Advanced metadata features

We already covered how you can specify sample metadata using either a simple csv file or a PEP. But in that tutorial we covered only the basic features of PEPs. PEPs are actually a lot more powerful, and many of those featuers are useful for looper projects. Here, we'll show you a few of the more advanced features of PEPs and explain how they can be useful with looper. We still won't cover everything here, though. If you want to see all the features of PEP, you should consult the detailed PEP documentation.

Learning objectives

What else can PEPs do that can make my life easier?

Note

These concepts aren't strictly about looper, they are about PEP. They just show you some examples of how your looper project could take advantage of PEP features.

Implied attributes

At some point, you may have a situation where you need a single sample attribute (or column) to populate several different pipeline arguments with different values. In other words, the value of a given attribute may imply values for other attributes. It would be nice if you didn't have to enumerate all of these secondary, implied attributes, and could instead just infer them from the value of the original attribute?

For example, if my organism attribute is human, this implies a few other secondary attributes (which may be project-specific): For one project, I want to set genome to hg38 and macs_genome_size to hs. Of course, I could just define columns called genome and macs_genome_size, but these would be constant across samples, so it feels inefficient and unwieldy. Plus, changing the aligned genome would require changing the sample annotation sheet (every sample, in fact). You can certainly do this with looper, but a better way is to handle these things at the project level.

As a more elegant alternative, PEP provides the imply sample modifier. Instead of hard-coding genome and macs_genome_size in the sample annotation sheet, you can simply specify that the attribute organism implies additional attribute-value pairs, which vary by sample based on the value of the organism attribute. This lets you specify assemblies, genome size, and other similar variables all in your project config file.

To do this, just add an imply sample modifier. Example:

sample_modifiers:
  imply:
    - if:
        organism: "human"
      then:
        genome: "hg38"
        macs_genome_size: "hs"
    - if:
        organism: "mouse"
      then:
        genome: "mm10"
        macs_genome_size: "mm"

In this example, any samples with organism set to "human" will automatically also have attributes for genome ("hg38") and for macs_genome_size ("hs"). Any samples with organism set to "mouse" will have the corresponding values. A sample with organism set to "frog" would lack attributes for genome and macs_genome_size, since those columns are not implied by "frog".

This system essentially lets you set global, species-level attributes at the project level instead of duplicating that information for every sample that belongs to a species. Even better, it's generic, so you can do this for any partition of samples (just replace organism with whatever you like). This makes your project more portable and does a better job conceptually with separating sample attributes from project attributes. After all, a reference assembly is not a property of a sample, but is part of the broader project context.

The 'amend' project modifier for subprojects

PEP provides not only sample modifiers, but project modifiers. You can use this to encode slightly different versions of a project, without duplicating the settings. For example, say we have a project that we align to a particular reference genome, say "hg38". We want to try running that on a different reference genome, say "hg19". Rather than duplicate the whole project or sample table and change everything, we can actually do this using the amend project modifier. Consider this PEP:

sample_modifiers:
  append:
    genome: "hg38"

project_modifiers:
  amend:
    hg19_alignment:
      sample_modifiers:
        append:
          genome: "hg19"

This is using the append modifier to set the genome attribute to hg38 for all samples. We can then use {sample.genome} in the pipeline interface to pass hg38 as a pipeline parameter. But we also have an amend section, which defines an amendment called hg19_alignment. If we activate this project with --amend hg19_alignment, then everything under that amendment will be attached to the PEP. In this example, it will add a new append modifier, which sets the genome attribute to hg19. Thus, reparameterizing this pipeline is as simple as choosing the amendment with the command line, --amend hg19_alignment.

How to handle multiple input files

Sometimes a sample has multiple input files that belong to the same attribute. For example, a common use case is a single library that was spread across multiple sequencing lanes, yielding multiple input files that need to be merged, and then run through the pipeline as one. Dealing with multiple input files is described in detail in the PEP documentation, but covered briefly here. PEP has two ways to merge these:

Use shell expansion characters (like * or []) in your file path definitions (good for simple merges)
Specify a sample subannotation tables which maps input files to samples for samples with more than one input file (infinitely customizable for more complicated merges).

To accommodate complex merger use cases, this is infinitely customizable.

Warning

Do not use both of these options for the same sample at the same time; that will lead to multiple mergers.

Using wild cards in derived sources

To do the first option, simply change the data source specification to use wild card characters:

pep_config.yaml

pep_version: 2.1.0
sample_table: sample_table.csv
sample_modifiers:
  append:
    file_path: source1
  derive:
    attributes: [file_path]
    sources:
      source1: "data/{sample_name}_*.txt"

Using a subsample table

For the second option, provide a subsample table in your pep config file:

pep_config.yaml

pep_version: 2.1.0
sample_table: sample_table.csv
subsample_table: subsample_table.csv

Example subsample_table.csv

sample_name,file_path
canada,data/canada_1.txt
canada,data/canada_2.txt
switzerland,data/switzerland_1.txt
switzerland,data/switzerland_2.txt
mexico,data/mexico_1.txt
mexico,data/mexico_2.txt

Make sure the sample_name column of this table matches, and then include any columns needed to point to the data. Looper will automatically include all of these files as input passed to the pipelines.

Important

Mergers are not the way to handle different functional/conceptual kinds of input files (e.g., read1 and read2 for a sample sequenced with a paired-end protocol). Such cases should be handled as separate derived columns in the main sample annotation sheet if they're different arguments to the pipeline.

Multi-value sample attributes behavior in the pipeline interface command templates

Both subsample tables and shell expansion characters lead to sample attributes with multiple values, stored in a list of strings as opposed to a standard scenario, where a single value is stored as a string (single_attr).

For example:

Sample
sample_name: canada
file_path: ['data/canada_1.txt', 'data/canada_2.txt']
single_attr: random_test_val

Access individual elements in lists

Pipeline interface author can leverage that fact and access the individual elements, e.g iterate over them and append to a string using the Jinja2 syntax:

pipeline_name: test_iter
pipeline_type: sample
command_template: >
  --input-iter {%- for x in sample.file_path -%} --test-individual {x} {% endfor %} # iterate over multiple values
  --input-single {sample.single_attr} # use the single value as is

This results in a submission script that includes the following command:

--input-iter  --test-individual data/canada_1.txt  --test-individual data/canada_2.txt
--input-single  random_test_val

Concatenate elements in lists

The most common use case is just concatenating the multiple values and separate them with space -- providing multiple input values to a single argument on the command line. Therefore, all the multi-value sample attributes that have not been processed with Jinja2 logic are automatically concatenated. For instance, the following command template in a pipeline interface will result in the submission script presented below:

Pipeline interface:

pipeline_name: test_concat
pipeline_type: sample
command_template: >
  --input-concat {sample.file_path} # concatenate all the values

Command in the submission script:

--input-concat  data/canada_1.txt data/canada_2.txt

Others

Some other things you might find interesting:

imports allow PEPs to import other PEPs, so you can re-use information across projects.

Summary

You can use the imply sample modifier to eliminate redundant columns in your sample table.
You can use multiple input files for your pipeline using wildcards or subsample tables.
PEP provides many powerful features that work well with looper.