PEP specification version 2.1.0

Table of contents:

Introduction and motivation

Bioinformatics projects often start from a sample table, a spreadsheet of samples as rows with attributes of those samples in columns. For example, the some attributes may include file paths to raw data, sample annotation like organism or treatment, and other experimental details. Unfortunately, each project is usually done differently -- there is little standardization of these file formats and column names across projects. The downstream processing tools that consume the sample table typically expect specific way of formatting the table, such as requiring certain columns, expecting certain file formats, and so on. These assumptions are often inherent in the tools, but rarely explained. And even where they are explained, they tend to be unique for each tool. There is no standard way to represent metadata that spans projects and tools. This restricts the portability and reusability of annotated datasets and software that processes them.

Portable Encapsulated Projects (PEP for short) seeks to make datasets and related software more portable and reusable. PEP does this by providing metadata standardization, metadata validation, and portability modifiers.

How PEP improves sample annotation portability

PEP provides 3 features to improve portability:

A standardized metadata structure. PEP standardizes sample metadata formats. This allows tools and pipelines to read data from different sources more easily.
A validation framework. PEP provides formal validation schemas. This allows us to confirm that a PEP complies with a requirements for an arbitrary tool.
Project and sample modifiers. PEP provides a powerful framework to programmatically modify sample- and project-level metadata. This allows us to systematize metadata so one input source can span multiple tools.

Definitions of terms and components of a PEP

A PEP can be made from any collection of metadata represented in tabular form. Typically, a PEP represents a data-intensive bioinformatics project with many samples, like individuals or cell lines. The key terms are:

Project: a collection of metadata that annotates a set of samples.
Sample: loosely defined; a unit that can be collected into a project, usually with one or more data files.
PEP specification: the way to organize project and sample metadata in files using a yaml + csv format.
PEP: a project that follows the PEP specification.

The PEP specification divides metadata into components: sample metadata, which can vary by sample, and project metadata, which applies to all samples. These components are stored in separate files. A complete PEP consists of up to 3 files:

Project config file - RECOMMENDED. a yaml file containing project-level metadata
Sample table - RECOMMENDED. a csv file of sample metadata, with 1 row per sample
Subsample table - OPTIONAL. A csv file of sample with multiple rows for each sample, used to specify sample attributes with multiple values (e.g. used to point to inputs in sequencing experiments when split across multiple files).

This document describes each of these 3 files in detail.

Validating a PEP

PEP uses an extended JSON Schema vocabulary with novel sample metadata features. The formal PEP spec is described as a schema at schema.databio.org/pep/2.1.0.yaml. You can validate a PEP against any PEP schema using the eido Python package like this:

eido validate path/to/your/PEP_config.yaml -s https://schema.databio.org/pep/2.1.0.yaml

The generic schema may be easily extended into a more specific schema that adds new requirements or optional attributes, requires input files, and so forth. You can find more detail about how to extend and use these schemas in the how-to guide for PEP validation.

Project config file specification

The project config file is the source of project-level information. It is the only required file and must be in yaml format. The config file includes five recognized project attributes, most being optional:

pep_version - REQUIRED
sample_table- RECOMMENDED
subsample_table- OPTIONAL
sample_modifiers - OPTIONAL
project_modifiers - OPTIONAL

These attributes may appear in any order.

Example

pep_version: 2.1.0
sample_table: "path/to/sample_table.csv"
subsample_table: ["path/to/subsample_table.csv", "path/to/subsample_table2.csv"]
sample_modifiers:
  append:
    attribute1: value
    attr2: val2 
  duplicate:
    oldattr: newattr
  imply:
    - if:
        genome: ["hg18", "hg19", "hg38"]
      then:
        organism: "human"
  derive:
    attributes: [read1, read2, other_attr]
    sources:
      key1: "path/to/derived/value/{attribute1}"
      key2: "path/to/derived/value/{attr2}"
project_modifiers:
  amend:
    variant1:
      sample_table: "path/to/alternative_table.csv"
  import:
    - external_pep.yaml
    - http://url.com/pep.yaml

Project attribute: `pep_version`

The only required project attribute, which documents the version of the PEP specification this PEP complies with. For PEP version 2.1.0, this must be the string "2.1.0".

Project attribute: `sample_table`

The sample_table is a path (string) to the sample csv file. It can be absolute or relative; relative paths are assumed relative to the location of the project_config.yaml file. The target file is expected to comply with the PEP specification for the sample table, described later.

Project attribute: `subsample_table`

The subsample_table is a path (string) to the subsample csv file or, in case the subsamples are dispersed across multiple annotation sheets, a collection of paths (array of strings). Like with the sample_table attribute, relative paths are assumed relative to the location of the project_config.yaml file. The target file is expected to comply with the PEP specification for the subsample table.

Project attribute: `sample_modifiers`

Sample modifiers are project settings that modify samples.

The sample modifiers allows you to modify sample attributes from within the project configuration file. You can use this to add new attributes to samples in a variety of ways, including attributes whose value varies depending on values of existing attributes, or whose values are composed of existing attribute values. This is a key feature of PEP that allows you to make the sample tables more portable. There are 5 subsections corresponding to 5 types of sample modifier: remove, append, duplicate, imply, and derive; and the samples will be modified in that order. Within each modifier, samples will be modified in the order in which the commands are listed.

Sample modifier: remove

*Remove* eliminates attribute from all samples.

The remove modifier elimiantes one or more sample attributes.

Example:

sample_modifiers:
  remove: 
    - read_type
    - organism

This example eliminates read_type and organism attributes from each sample. This modifier is useful when one is in need to override an attribute with another on-the-fly. This allows that without editing the annotation sheet by hand.

Sample modifier: append

*Append* adds a constant attribute to all samples.

The append modifier adds additional sample attributes with a constant value across all samples.

Example:

sample_modifiers:
  append:
    read_type: SINGLE

This example adds a read_type attribute to each sample, with the value SINGLE for all samples. This modifier is useful on its own to add constant attributes, and can also be combined with derive and/or imply.

Sample modifier: duplicate

*Duplicate* copies an attribute to a new name.

The duplicate modifier copies an existing sample attribute to a new attribute with a different name. This can be useful if you need to tweak a PEP to work under a different tool that specifies a different schema for the same data.

Example:

sample_modifiers:
  duplicate:
    old_attribute_name: new_attribute_name

This example would copy the value of old_attribute_name to a new attribute called new_attribute_name.

Sample modifier: imply

*Imply* depends on other attribute values.

The imply modifier adds sample attributes with values that depends on the value of an existing attribute. Under the imply keyword is a list of items. Each item has an if section and a then section. The if section defines one or more attributes, each with one or more possible values. If all attributes listed have any of the values in the list for that attribute, then the sample passes the conditional and the implied attributes will be added. One or more attributes to imply are listed under the then section as key: value pairs.

Example:

sample_modifiers:
  imply:
    - if:
        organism: "human"
      then:
        genome: "hg38"
        macs_genome_size: "hs"

This example will take any sample with organism attribute set to the string "human" and add attributes of genome (with value "hg38") and macs_genome_size (with value "hs"). This example shows only 1 implication, but you can include as many as you like.

Implied attributes can be useful for pipeline arguments. For instance, it may be that one sample attribute implies several more. Rather than encoding these each as separate columns in the annotation sheet for a particular pipeline, you may simply indicate in the project_config.yaml that samples of a certain type should automatically inherit additional attributes. For more details, see how to eliminate project-level attributes from a sample table.

Sample modifier: derive

*Derive* builds new attributes from existing values.

The derive sample modifier converts existing sample attribute values into new values derived from other existing sample attribute values. It contains two sections; in attributes is a list of existing attributes that should be derived; in sources is a mapping of key-value pairs that defines the templates used to derive the new attribute values. The sources templates are available for all entries under attributes .

Example:

sample_modifiers:
  derive:
    attributes: [read1, read2, data_1]
    sources:
      key1: "/path/to/{sample_name}_{sample_type}.bam"
      key2: "/from/collaborator/weirdNamingScheme_{ext_id}.fastq"
      key3: "${HOME}/{test_id}.fastq"

In this example, the samples should already have attributes named read1, read2, and data_1, which are flagged as attributes to derive. These attribute values should originally be set to one of the keys in the sources section: key1, key2, or key3. The derive modifier will replace any samples set as key1 with the specified string ("/path/to/{sample_name}_{sample_type}.bam"), but with variables like {sample_name} populated with the values of other sample attributes. The variables in the file paths are formatted as {variable}, and are populated by sample attributes (columns in the sample annotation sheet). For example, your files may be stored in /path/to/{sample_name}.fastq, where {sample_name} will be populated individually for each sample in your PEP. You can also use shell environment variables (like ${HOME}) or wildcards (like *).

Using derive is a powerful and flexible way to point to data files on disk. This enables you to point to more than one input file for each sample. For more details and a complete example, see how to eliminate paths from the sample table.

Project attribute: `project_modifiers`

The project modifiers allows you to modify project-level attributes from within the project configuration file. There are 2 subsections corresponding to 2 types of project modifier: import and amend. Imports run first, followed by amendments.

Project modifier: import

*Imports* include external PEP config files.

The import project modifier allows the config file to import other external PEP config files. The values in the imported files will be overridden by the corresponding entries in the current config file. Imports are recursive, so an imported file that imports another file is allowed; the imports are resolved in cascading order with the most distant imports happening first, so the closest configuration options override the more distant ones.

Example:

project_modifiers:
  import:
    - path/to/parent_project_config.yaml

Imports can be used to record and manage complex analysis relationships among analysis components. In a sense, imports are the opposite of amendments, because they allow combining multiple PEP files into one. When used in combination with amendments, they make it possible to orchestrate very powerful analysis. For more information, see how to integrate imports and amendments.

Project modifier: amend

*Amendments* specify project variations within one file.

The amend project modifier specifies multiple variations of a project within one file. When a PEP is parsed, you may select one or more included amendments, which will amend the values in the processed PEP. Unlike all other sample or project modifiers, amendments are optional and must be activated individually when the PEP is loaded.

For example:

sample_table: annotation.csv
project_modifiers:
  amend:
    my_project2:
      sample_table: annotation2.csv
    my_project3:
      sample_table: annotation3.csv
...

If you load this configuration file, by default it sets sample_table to annotation.csv. If you don't activate any amendments, they are ignored. But if you choose, you may activate one of the two amendments, which are called my_project2 and my_project3. If you activate my_project2, by passing amendments=my_project2 when parsing the PEP, the resulting object will use the annotation2.csv sample_table instead of the default annotation.csv. All other project settings will be the same as if no amendment was activated because there are no other values specified in the my_project2 amendment.

Amendments are useful to define multiple similar projects within a single project config file. Under the amendments key, you specify names of amendments, and then underneath these you specify any project config variables that you want to override for that particular amendment. It is also possible to activate more than one amendment in priority order, which allows you to combine different project features on-the-fly. For more details, see how to mix and match amendments.

Sample table specification

The sample_table is a .csv file containing information about all samples (or pieces of data) in a project. A sample table may contain any number of columns with any column names. Each column corresponds to an attribute of a sample. For this reason, we sometimes use the word column and attribute interchangeably.

Sample table index specification for sample identification

Samples tables must include an identifier attribute, or index, which specifies unique strings identifying each sample. This should be a string without whitespace. By default, PEP uses sample_name column as the index for the sample table, but this can be changed in the project configuration. The sample table index selection priority order is:

Value specified in Project constructor
Value specified in Config with sample_table_index attribute
Default value (sample_name)

Typically, one row corresponds to one sample, so the sample_name attribute would be unique in the table; however, PEP v2.1.0 allows multiple rows per sample as a way to specify multi-value attributes. Here are some examples of both approaches: First, here is a table with one row per sample:

"sample_name","protocol","organism","flowcell","lane", "data_source"
"albt_0h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_1h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_2h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_3h","RRBS","albatross","BSFX0190","1","bsf_sample"
"frog_0h","RRBS","frog","","","frog_data"
"frog_1h","RRBS","frog","","","frog_data"
"frog_2h","RRBS","frog","","","frog_data"
"frog_3h","RRBS","frog","","","frog_data"

In a table with duplicate values in the index column, the rows with the same identifier will be merged into a single sample, with potentially many values for other attributes. Here's an example where the sample_name column has a duplicated albt_1h value.

"sample_name","organism","flowcell","lane"
"albt_0h","albatross","BSFX0190","1"
"albt_1h","albatross","BSFX0190","1"
"albt_1h","albatross","BSFX0190","2"
"albt_2h","albatross","BSFX0190","1"

This table has 4 rows, but the processed PEP has only 3 samples. The albt_1h sample has the following attributes:

organism: `albatross`
flowcell: `BSFX0190`
lane: [`1`, `2`]

A sample table with no attributes satisfies the generic PEP requirement, but it isn't really useful. Therefore, tools that use PEPs should make use of the PEP validation framework to specify further requirements. For more details, see the how-to guide for PEP validation.

Subsample table specification

For users who prefer to keep to one-row-per-sample, PEP can accommodate multi-value attributes with a subsample_table, a second .csv file. This approach keeps the multi-layered data structure out of the sample table, keeping it cleaner and simpler at the cost of an additional csv file. In the subsample table, multiple values for an attribute are specified as multiple rows with the same sample name. The subsample table contains an index column that maps to the index column in the sample table. This may be configured with the subsample_table_index value in the project configuration.

One common use case for subsample tables is for when samples have multiple input files of the same type. For example, in a sequencing experiment, it's common to split samples across multiple sequencing lanes, which each yield a separate file. Subsample tables are one way to associate many files to a single sample attribute.

Here's a simple example. If you define the sample_table like this:

sample_name,library
frog_1,anySampleType
frog_2,anySampleType

Then point subsample_table to the following, which maps sample_name to a new column called file

sample_name,file
frog_1,data/frog1a_data.txt
frog_1,data/frog1b_data.txt
frog_1,data/frog1c_data.txt
frog_2,data/frog2a_data.txt
frog_2,data/frog2b_data.txt

This sets up a simple relational database that maps multiple files to each sample. You can also combine a subsample table with derived attributes; attributes will first be derived and then merged, leading to a very flexible way to point to many files of a given type for single sample.

Subsample table index

By default, PEP uses subsample_name and sample_name columns as the indexes for the subsample table. However, it is possible to use a custom column as the sample table index, which can be specified with subsample_table_index attribute or on-the-fly at the project creation stage.

This is the subsample table index selection priority order:

Value specified in Project constructor
Value specified in Config
Default value (subsample_name and sample_name)

PEP specification version 2.1.0

Introduction and motivation

How PEP improves sample annotation portability

Definitions of terms and components of a PEP

Validating a PEP

Project config file specification

Project attribute: pep_version

Project attribute: sample_table

Project attribute: subsample_table

Project attribute: sample_modifiers

Sample modifier: remove

Sample modifier: append

Sample modifier: duplicate

Sample modifier: imply

Sample modifier: derive

Project attribute: project_modifiers

Project modifier: import

Project modifier: amend

Sample table specification

Sample table index specification for sample identification

Subsample table specification

Subsample table index

Project attribute: `pep_version`

Project attribute: `sample_table`

Project attribute: `subsample_table`

Project attribute: `sample_modifiers`

Project attribute: `project_modifiers`