Advanced run options

Introduction

The basic tutorials showed you how to use looper run with the defaults, and introduced a few of the arguments, like --dry-run and --limit. The computing tutorial covered some arguments related to computing resources, like --package and --compute. There are many other arguments to looper run that can help you control how looper creates and submits jobs. Let's introduce some of the more advanced capabilities of looper run.

Learning objectives

What are some of the more advanced options for looper run?
How do I submit a job for the entire project, instead of a separate job for each sample?
How can I submit different pipelines for different samples?
How can I adjust pipeline arguments on-the-fly?
What if I want to submit multiple samples in a single job?
Can I exclude certain samples from a run?

Grouping many jobs into one

By default, looper will translate each row in your sample_table into a single job. But perhaps you are running a project with tens of thousands of rows, and each job only takes mere minutes to run; in this case, you'd rather just submit a single job to process many samples. Looper makes this easy with the --lump and --lumpn command line arguments.

Lumping jobs by job count: `--lumpn`

It's quite simple: if you want to run 100 samples in a single job submission script, just tell looper --lumpn 100.

Lumping jobs by input file size: `--lump`

But what if your samples are quite different in terms of input file size? For example, your project may include many small samples, which you'd like to lump together with 10 jobs to 1, but you also have a few control samples that are very large and should have their own dedicated job. If you just use --lumpn with 10 samples per job, you could end up lumping your control samples together, which would be terrible. To alleviate this problem, looper provides the --lump argument, which uses input file size to group samples together. By default, you specify an argument in number of gigabytes. Looper will go through your samples and accumulate them until the total input file size reaches your limit, at which point it finalizes and submits the job. This will keep larger files in independent runs and smaller files grouped together.

Lumping jobs by input file size: `--lumpj`

Or you can lump samples into number of jobs.

Running project-level pipelines

What are project-level pipelines?

The tutorials assume you're running a pipeline where you're trying to run one job per sample, i.e. the samples are independent, and you want to do the same thing to each of them. This is the most common use case for looper.

But sometimes, you're interested in running a job on an entire project. This could be a pipeline that integrates data across all the samples, or one that will summarize the results of your independent sample pipeline runs. In this case, you really only need to submit a single job. Looper's main benefit is handling the boilerplate needed to construct the submission scripts and submit a separate job for each sample. But looper also provides some help for the task of submitting a single job for the whole project.

So, there are really two types of pipelines looper can submit. The typical ones, which need one job per sample, we call sample-level pipelines. For many use cases, that's all you need to worry about. The broader level, which need one job for the whole project, we call project-level pipelines.

Split-apply-combine

One of the most common uses of a project-level pipeline is to summarize or aggregate the results of the sample-level pipeline. This approach essentially employs looper as an implementation of the MapReduce programming model, which applies a split-apply-combine strategy. We split the project into samples and apply the first tier of processing (the sample pipeline). We then combine the results in the second tier of processing (the project pipeline). Looper doesn't require you to use this two-stage system, but it does make it easy to do so. Many pipelines operate only at the sample level and leave the downstream cross-sample analysis to the user.

How to run a project-level pipeline

The usual looper run command runs sample-level pipelines. This will create a separate job for each sample. Pipeline interfaces defining a sample pipeline do so under the sample_interface attribute.

Project pipelines are run with looper runp (where the p stands for project). The interface specifies that it is a project pipeline by using the project_interface attribute. Running a project pipeline operates in almost exactly the same way as the sample pipeline, with 2 key differences:

First, instead of a separate command for every sample, looper runp creates a single command for the project (per pipeline).
Second, the command template cannot access the sample namespace representing a particular sample, since it's not running on a particular sample; instead, it will have access to a samples (plural) namespace, which contains all the attributes from all the samples.

In a typical workflow, a user will first run the samples individually using looper run, and then, if the pipeline provides one, will run the project component using looper runp to summarize to aggregate the results into a project-level output.

Example of a pipeline interface containing a sample-level interface AND a project-level interface:

pipeline_name: example_pipeline
output_schema: pipestat_output_schema.yaml
sample_interface:
  command_template: >
    count_lines.sh {sample.file} {sample.sample_name}
project_interface:
  command_template: >
    count_lines_project.sh "data/*.txt"

Running multiple pipelines

To run more than one pipeline, specify multiple pipeline interfaces in the looper config file:

pep_config: pephub::databio/looper:default
output_dir: "$HOME/hello_looper-master/output"
pipeline_interfaces:
  - "$HOME/hello_looper-master/pipeline/pipeline_interface"
  - "$HOME/hello_looper-master/project/pipeline"

You can also link to the pipeline interface with a sample attribute. If you want the same pipeline to run on all samples, it's as easy as using an append modifier like this:

sample_modifiers:
  append:
    pipeline_interfaces: "test.yaml"

But if you want to submit different samples to different pipelines, depending on a sample attribute, like protocol, you can use an implied attribute:

sample_modifiers:
  imply:
    - if:
        protocol: [PRO-seq, pro-seq, GRO-seq, gro-seq] # OR
      then:
        pipeline_interfaces: ["peppro.yaml"]

This approach uses only functionality of PEPs to handle the connection to pipelines as sample attributes, which provides full control and power using the familiar sample modifiers. It completely eliminates the need for re-inventing this complexity within looper, which eliminated the protocol mapping section to simplify the looper pipeline interface files. You can read more about the rationale of this change in issue 244.

Modulating pipeline by sample

If you have a project that contains samples of different types, you may want to submit a different pipeline for each sample. Usually, we think of looper as running the same pipeline for each sample. The best way to solve this problem is to split your sample table up into different tables, and then run a different pipeline on each. But if you really want to, you can actually modulate the pipeline by sample attributes. You can use an imply modifier in your PEP to select which pipelines you want to run on which samples, like this:

sample_modifiers:
  imply:
    - if:
        protocol: "RRBS"
      then:
        pipeline_interfaces: "/path/to/pipeline_interface.yaml"
    - if:
        protocol: "ATAC"
      then:
        pipeline_interfaces: "/path/to/pipeline_interface2.yaml"

A more complicated example taken from PEPATAC of a project_config.yaml file:

pep_version: 2.0.0
sample_table: tutorial.csv

sample_modifiers:
  derive:
    attributes: [read1, read2]
    sources:
      # Obtain tutorial data from http://big.databio.org/pepatac/ then set
      # path to your local saved files
      R1: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r1.fastq.gz"
      R2: "${TUTORIAL}/tools/pepatac/examples/data/{sample_name}_r2.fastq.gz"
  imply:
    - if: 
        organism: ["human", "Homo sapiens", "Human", "Homo_sapiens"]
      then: 
        genome: hg38
        prealignment_names: ["rCRSd"]
        deduplicator: samblaster # Default. [options: picard]
        trimmer: skewer          # Default. [options: pyadapt, trimmomatic]
        peak_type: fixed         # Default. [options: variable]
        extend: "250"            # Default. For fixed-width peaks, extend this distance up- and down-stream.
        frip_ref_peaks: None     # Default. Use an external reference set of peaks instead of the peaks called from this run

Passing extra command-line arguments

Occasionally, a particular project needs to run a particular flavor of a pipeline. How can you adjust pipeline arguments for just this project? You can use looper command extras to solve this problem. Command extras let you pass any string on to the pipeline, which will be appended to the command.

There are 2 ways to use command extras: for sample pipelines, or for project pipelines:

1. Sample pipeline command extras

Adding sample command extras via sample attributes

Looper uses a reserved sample attribute called command_extras, which you can set using general PEP sample modifiers however you wish. For example, if your extras are the same for all samples you could use an append modifier:

sample_modifiers:
  append:
    command_extra: "--flavor-flag"

This will add --flavor-flag the end of the command looper constructs. If you need to modulate the extras depending on another attribute value, you could use an imply modifier:

sample_modifiers:
  imply:
    - if:
        protocol: "rrbs"
      then:
        command_extra: "-C flavor.yaml --epilog"

Adding sample command extras via the command line

You can also pass extra arguments using --command-extra like this:

looper run project_config.yaml --command-extra="--flavor-flag"

2. Project pipeline command extras

For project pipelines, you can specify command extras in the looper section of the PEP config:

looper:
  output_dir: "/path/to/output_dir"
  cli:
    runp:
      command-extra: "--flavor"

or as an argument to the looper runp command:

looper runp project_config.yaml --command-extra="--flavor-flag"

Overriding PEP-based command extras

By default, the CLI extras are appended to the command_extra specified in your PEP. If you instead want to override the command extras listed in the PEP, you can instead use --command-extra-override.

So, for example, make your looper call like this:

looper run --command-extra-override="-R"

That will remove any defined command extras and append -R to the end of any commands created by looper.

Add CLI arguments to looper config

You can provide a cli keyword to specify any command line (CLI) options from within the looper config file. The subsections within this section direct the arguments to the respective looper subcommands. For example, to specify a sample submission limit for a looper run command, use:

cli:
    run:
      limit: 2

Keys in the cli.<subcommand> section must match the long argument parser option strings, so command-extra, limit, dry-run and so on. For more CLI options refer to the subcommands usage.

Selecting or excluding samples

Looper provides several ways to select (filter) samples, so you only submit certain ones.

Sample selection by inclusion

To submit only certain samples, specify the sample attribute with --sel-attr and the values the attribute can take --sel-incl. For example, to choose only samples where the species attribute is human, mouse, or fly:

looper run \
  --sel-attr species
  --sel-incl human mouse fly

Similarly, to submit only one sample, with sample_name as sample, you could use:

looper run \
  --sel-attr sample_name
  --sel-incl sample1

Sample selection by exclusion

If more convenient to exclude samples by filter, you can use the analogous arguments --sel-attr with --sel-excl. This will

Toggling sample jobs through the sample table

You can also set the toggle value of attributes (either in your sample table or via a sample modifier). If the value of this column is not 1, looper will not submit the pipeline for that sample. This enables you to submit a subset of samples.

Summary

You can use --lump, --lump-n, or --lump-j to group jobs into the same script.
Looper run has sample selection and exclusion arguments.
You can attach multiple pipeline interfaces to a looper project, resulting in multiple pipeline submissions.