Skip to content

Writing a pipeline interface

Introduction

In the User Tutorial we walked you through creating a Looper workspace to run a pipeline on a dataset. That tutorial assumed you already want to run an existing looper-compatible pipeline. For pipelines that are already looper-compatible, you just point your looper configuration file at pipeline's interface file, as described in the User Tutorial.

This Developer Tutorial goes into more detail on how to create a looper-compatible pipeline. If you're interested in building a looper-compatible pipeline, or taking an existing pipeline and making it work with looper, the critical point of contact is the pipeline interface. A pipeline interface describes how to run a pipeline.

This tutorial will show you how to write a pipeline interface. Once you've been through this, you can consult the formal pipeline interface format specification for further details and reference.

Learning objectives

  • What is a looper pipeline interface?
  • How do I write a looper pipeline interface for my custom pipeline?
  • Do I have to write a pipeline interface to have looper just run some command that isn't really a pipeline?

An example pipeline interface

Each Looper project requires one or more pipeline interfaces that points to sample and/or project pipelines. The looper config file points to the pipeline interface, and the pipeline interface tells looper how to run the pipeline.

pipeline interface

Let's revisit the simple pipeline interface the count_lines pipeline example:

pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    pipeline/count_lines.sh {sample.file_path}

There are 2 required keys. First, the pipeline_name can be anything, as long as it is unique to the pipeline. It will be used for communication and keeping track of which results belong to which pipeline. Think of it as a unique identifier label you assign to the pipeline.

Second, the sample_interface tells looper (and pipestat) that this is a sample-level pipeline. Under sample_interface is the command_template, which holds our command string with variables that will be populated. In this example, looper will run pipeline/count_lines.sh. Looper will pass a single argument to the script, {sample.file_path}, which is a variable that will be populated with the file_path attribute on the sample. This file_path field which corresponds to a column header in the sample table. As the pipeline author, your command template can use any attributes with {sample.<attribute>} syntax. You can make the command template much more complicated and refer to any sample or project attributes, as well as a bunch of other variables made available by looper.

That's all you need for a basic pipeline interface! But there's also a lot more you can do, such as providing a schema to specify inputs or outputs, making input-size-dependent compute settings, and more. Let's walk through some of these more advanced options.

Initializing a generic pipeline interface

So you don't have to remember the general syntax, you can run a command that will create a generic pipeline interface, a generic output_schema.yaml (for pipestat-compatible pipelines) and a generic pipeline to get you started:

looper init_piface

Making pipeline interface portable with {looper.piface_dir}

One of the problems with the above pipeline interface is that it is hard-coding a relative path to the pipeline, pipeline/count_lines.sh:

pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    pipeline/count_lines.sh {sample.file_path}

Looper will literally run pipeline/count_lines.sh, which will only work if the working directory contains the pipeline/ subfolder holding count_lines.sh. This may work, since we typically call looper run from our looper workspace folder, where the .looper.yaml config file lives.

For example, it would work if you invoke looper run from this working directory:

.
├── pipeline
│   └── count_lines.sh
├── pipeline_interface.yaml
└── .looper.yaml

But what if we want to run the jobs from a different directory? It would be better if the jobs could run correctly from any working directory.

One way we could do that is to hard-code the absolute path to the script. Then, the resulting commands could run from anywhere on that system.

pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    /absolute/path/to/pipeline/count_lines.sh {sample.file_path}

But then this pipeline interface this would not be portable. It could not be shared with another user on a different system.

Another solution is that we could just put the pipeline command into our shell PATH, so it can be run globally, like this:

pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    count_lines {sample.file_path}

This way, it would run correctly from any folder, and would work on any system. But then users of our pipeline would have to adjust their PATH to make the pipeline globally runnable. This is fine for an installed pipeline intended to be in the PATH, but it is not ideal for basic scripts or other non-global pipelines.

Luckily, looper offers a better solution. It is convenient to specify a path relative to the pipeline interface file, which looper can populate as needed. For this purpose, looper provides the {looper.piface_dir} variable. The pipeline interface author can make any paths portable by simply prepending {looper.piface_dir} to the script in a command_template.

Our improved pipeline interface looks like this:

pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    {looper.piface_dir}/pipeline/count_lines.sh {sample.file_path}

Looper will automatically populate {looper.piface_dir} on the system with the absolute path, giving us the best of both worlds. The pipeline script can be distributed alongside the pipeline interface, and this will automatically work from any working directory, on any system.

Therefore, we recommend pipeline authors always use {looper.piface_dir} to make sure their pipeline interfaces are portable. This is also true for pre-submit command templates.

Validating sample input attributes

Right now, looper will create and submit jobs for every row in the table. What if we want to prevent jobs from being submitted if the row lacks all the information required by the pipeline? This could help us fail early and save submitting jobs that won't succeed. For example, what if we want to make sure a particular required attribute is set before we submit a job?

To demonstrate, let's make a simple modification to the pipeline. We'll adjust it to report the area type.

pipeline/count_lines.sh
#!/bin/bash
linecount=`wc -l $1 | sed -E 's/^[[:space:]]+//' | cut -f1 -d' '`
export area_type=$2
echo "Number of ${area_type}s: $linecount"

Then, we'll adjust the pipeline interface to pass the area_type as the second argument to the pipeline.

pipeline_interface.yaml
# stuff before
sample_interface:
  command_template: >
    pipeline/count_lines.sh {sample.file_path} {sample.area_type}

Now, let's add a new sample to the sample table that lacks the area_type.

metadata/sample_table.csv
sample_name,area_type
mexico,state
switzerland,canton
canada,province
usa,

Right now, if you invoke looper run, you will create a job for all 4 samples. But what we want to do is tell looper that the last sample is incomplete, and should give an error instead of creating and running the job. We will do that through the input_schema.

Create an input schema

The input schema formally specifies the input processed by this pipeline. It serves several purposes, like sample validation, specifying which attributes should be used to determine sample size, and specifying which attributes are files. We'll get into those details later. First, let's make an input schema. Paste this text into pipeline/input_schema.yaml

input_schema.yaml
description: An input schema for count_lines pipeline pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        file_path: 
          type: string
          description: "Path to the input file to count"
        area_type:
          type: string
          description: "Name of the components of the country"
      required:
        - sample_name
        - file_path
        - area_type
required:
  - samples

This file specifies what inputs the pipeline uses, and what type they are. It is a JSON Schema, which allows us to use this file to validate the inputs, which we'll cover later. For now, it defines that our input samples have 3 properties: sample_name, file_path, and area_type.

Adapt the pipeline interface to use the input schema

Next, we need to tell looper to use this as the input schema for the pipeline. Do this by adding the input_schema in the pipeline interface.

pipeline_interface.yaml
pipeline_name: count_lines
input_schema: input_schema.yaml
sample_interface:
  command_template: >
    {looper.piface_dir}/pipeline/count_lines.sh {sample.file_path}

Now, the input schema is created and linked to the pipeline. If you invoke looper run, looper will read the schema, and first validate the sample before running the job. In this example, the first 3 samples will validate and run as before, but the final sample will not pass the validation stage.

To solve the problem, we need to specify the area_type for country usa. Let's add that back in.

metadata/sample_table.csv
sample_name,area_type
mexico,state
switzerland,canton
canada,province
usa,state

Now, all 4 samples will validate and looper run will run all jobs.

Validating that input files exist

Another thing we may want to validate is that a file actually exists. But how does looper know which attributes point to files that must exist? It is not enough to state that the file_path attribute is required; this simply means the sample specifies something for that attribute. It also must actually point to a file that exists. Looper does this through the tangible key.

input_schema.yaml
description: An input schema for count_lines pipeline pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        file_path: 
          type: string
          description: "Path to the input file to count"
      tangible:
        - file_path
      sizing:
        - file_path
      required:
        - sample_name
        - file_path
required:
  - samples

By specifying file_path under required, we are telling looper to make sure that attribute is defined on the sample. Then, by adding file_path under tangible, we are telling looper that the file it points to must actually exist.

To test, the usa sample should be missing its data file, so it should not run. Try it by running looper run.

You will see:

...
## [3 of 4] sample: canada; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_canada.sub
Job script (n=1; 0.05Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_canada.sub
Compute node: zither
Start time: 2024-09-19 16:27:32
Number of lines: 4091500
## [4 of 4] sample: usa; pipeline: count_lines
1 input files missing, job input size was not calculated accurately
> Not submitted: Missing files: data/usa.txt
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.sub

Looper finished

Looper will not submit this job because it recognizes that there is no file in data/usa.txt. Let's make one. This command will make new data file for usa with 100 million lines in it and a file size of around 1 Gb.

shuf -r -n 100000000 data/mexico.txt > data/usa.txt

Now if you run this it will work. But this also leads to another thought. The usa.txt text file is way larger than the other files. What if we need to modulate the job parameters by the size of the input file?

Parameterizing job templates by sample through size-dependent variables

In the count_lines tutorials, we showed how to parameterize a job submission template via the compute section in the pipeline interface file, like this:

pipeline/pipeline_interface.yaml
pipeline_name: count_lines
sample_interface:
  command_template: >
    {looper.piface_dir}/pipeline/count_lines.sh {sample.file_path}
compute:
  partition: standard
  time: '01-00:00:00'
  cores: '32'
  mem: '32000'

This allows a job submission template to use computing variables like {CORES} and {MEM}, so you can control cluster resources. This provides the same compute parameters for every sample. What if we have a huge file and we want to use different parameters? It's common to need to increase memory usage for a larger sample. Looper provides a simple but powerful way to configure pipelines depending on the input size of the samples.

Replace the compute variables with a single variable, size_dependent_variables, which points to a .tsv file:

pipeline_interface.yaml
compute:
  partition: standard
  size_dependent_variables: resources-sample.tsv

(Here, we leave the partition variable there, since it won't change by sample input size).

Then, create resources-sample.tsv with these contents:

resources-sample.tsv
max_file_size   cores   mem time
0.05    1   1000    00-01:00:00
0.5 2   2000    00-03:00:00
NaN 4   4000    00-05:00:00

In this example, the partition will remain constant for all samples, but the cores, mem, and time variables will be modulated by sample file size. For each sample, looper will select the first row for which the sample's input file size does not exceed the row's max_file_size value in gigabytes. In this example:

Files up to 0.05 Gb (50 Mb) will use 1 core, 1000 mb of RAM, and 1 hour of clock time. Files up to 0.5 Gb (500 Mb) will use 2 cores, 2000 mb of RAM, and 3 hours of clock time. Anything larger will use 4 cores, 4000 mb of RAM, and 5 hours of clock time.

You can think of it like a coin sorter machine, going through a priority list of parameter sets with increasing file size allowed, and the sample falls into first bin in which it fits. This allows a pipeline author to specify different computing requirements depending on the size of the input sample.

To see this in action, we will need to use a compute template that uses these parameters. The slurm template should do the trick. We'll use -d to specify a dry run, so the jobs are created, but not run.

looper run -p slurm -d

You should see output like this:

Activating compute package 'slurm'
## [1 of 4] sample: mexico; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_mexico.sub
Job script (n=1; 0.00Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_mexico.sub
Dry run, not submitted
## [2 of 4] sample: switzerland; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_switzerland.sub
Job script (n=1; 0.00Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_switzerland.sub
Dry run, not submitted
## [3 of 4] sample: canada; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_canada.sub
Job script (n=1; 0.00Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_canada.sub
Dry run, not submitted
## [4 of 4] sample: usa; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.sub
Job script (n=1; 0.00Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.sub
Dry run, not submitted

But wait! The usa sample file size is listed as 0Gb. Why is it not recognizing the true size of the input file? There is one more thing we need to do.

Configuring how looper determines sample size

You can use the input_schema sizing keyword to tell looper which attributes determine file size. In most cases, sizing should be the same as tangible, because any file that is listed and must exist on disk would probably be the files that you want to use to determine the input file size. But there is a possibility that you might have some required input files that you don't want to use to modulate how you think of the size of the sample. For this reason, we have the ability to define them separately. So, all we have to do is add file_path to the sizing section in the input schema, like this.

input_schema.yaml
description: An input schema for count_lines pipeline pipeline.
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name: 
          type: string
          description: "Name of the sample"
        file_path: 
          type: string
          description: "Path to the input file to count"
        area_type:
          type: string
          description: "Name of the components of the country"
      required:
        - sample_name
        - file_path
        - area_type
      sizing:
        - file_path
required:
  - samples

Under sizing, it lists file_path, which tells looper that this variable should be used for determining the size of the sample.

Now, if we try again, we will see that it works

looper run -p slurm -d --skip 3
## [1 of 4] sample: usa; pipeline: count_lines
Writing script to /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.sub
Job script (n=1; 0.99Gb): /home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.sub
Dry run, not submitted

Looper is computing this size based on the size of file listed in file_path. Now, look at the job script that was created, which contains the parameters for large files:

cat results/submission/count_lines_usa.sub
#!/bin/bash
#SBATCH --job-name='count_lines_usa'
#SBATCH --output='/home/nsheff/code/hello_looper/input_schema_example/results/submission/count_lines_usa.log'
#SBATCH --mem='4000'
#SBATCH --cpus-per-task='4'
#SBATCH --time='00-05:00:00'
#SBATCH --partition='{PARTITION}'
#SBATCH -m block
#SBATCH --ntasks=1

echo 'Compute node:' `hostname`
echo 'Start time:' `date +'%Y-%m-%d %T'`

pipeline/count_lines.sh data/usa.txt  

We recommend pipeline authors configure their pipeline interfaces with appropriate size_dependent_variables file, which should be distributed alongside the pipeline interface.

Adding pre-submission commands to your pipeline run

Sometimes we need to run a set-up task before submitting the main pipeline. Looper handles this through a pre-submission hook system.

Looper provides several built-in hooks that do things like write out the sample information in different ways, which can be useful for different types of pipeline. If your pipeline would benefit from one of these representations, all you need to do is add a reference to the plugin into the pipeline interface, like this:

pipeline_interface.yaml
pre_submit:
  python_functions:
    - looper.write_sample_yaml

This will instruct looper to engage this plugin. You can also parameterize the plugin by adding other variables into the pipeline interface. Consult the plugin documentation for more details about how to parameterize the plugins.

You can also write your own custom plugins, which can be as simple as a shell script, or they could be Python functions. Here's an example of a Python script that will be executed through the command-line:

pipeline_interface.yaml
pre_submit:
  command_templates:
    - "{looper.piface_dir}/hooks/script.py --genome {sample.genome} --log-file {looper.output_dir}/log.txt"

Looper will run this command before running each job, allowing you to create little set-up scripts to prepare things before a full job is run.

Project-level pipeline interfaces

Remember, looper distinguishes sample-level from project-level pipelines. This is explained in detail in Advanced run options and in How to run a project-level pipeline. Basically, sample-level pipelines run once per sample, whereas project-level pipelines run once per project. If this interface were describing a project-level pipeline, we would change out sample_interface to project_interface.

pipeline_interface.yaml
pipeline_name: count_lines
project_interface:
  command_template: '{looper.piface_dir}/count_lines.sh

You can also write an interface with both a sample_interface and a project_interface. This would make sense to do for a pipeline that had two parts, one that you run independently for each sample, and a second one that aggregates all those sample results at the project level.

The differences between using sample_interface vs project_interface are pretty simple, actually. When a user invokes looper run, looper reads data under sample_interface and will create one job template per sample. In contrast, when a user invokes looper runp, looper reads data under project_interface and creates one job per project (one job total).

The only other difference is that the sample-level command template has access to the {sample.<attribute>} namespace, whereas the project-level command template has access to the {samples} array.

Summary

  • You can initialize and modify a generic pipeline interface using looper init_piface.
  • You should make your pipeline interfaces portable by using {looper.piface_dir}.
  • Pipeline-specific and sample-specific compute variables can be added to the pipeline interface to set items such as memory and number of cpus during the pipeline run.
  • You can specify pre-submission hooks for job setup in the pipeline interface.