PEP specification version 2.1.0
Table of contents:
- Introduction and motivation
- How PEP improves sample annotation portability
- Definitions of terms and components of a PEP
- Validating a PEP
- Project config file specification
- Sample table specification
- Subsample table specification
Introduction and motivation
Bioinformatics projects often start from a sample table, a spreadsheet of samples as rows with attributes of those samples in columns. For example, the some attributes may include file paths to raw data, sample annotation like organism or treatment, and other experimental details. Unfortunately, each project is usually done differently -- there is little standardization of these file formats and column names across projects. The downstream processing tools that consume the sample table typically expect specific way of formatting the table, such as requiring certain columns, expecting certain file formats, and so on. These assumptions are often inherent in the tools, but rarely explained. And even where they are explained, they tend to be unique for each tool. There is no standard way to represent metadata that spans projects and tools. This restricts the portability and reusability of annotated datasets and software that processes them.
Portable Encapsulated Projects (PEP for short) seeks to make datasets and related software more portable and reusable. PEP does this by providing metadata standardization, metadata validation, and portability modifiers.
How PEP improves sample annotation portability
PEP provides 3 features to improve portability:
- A standardized metadata structure. PEP standardizes sample metadata formats. This allows tools and pipelines to read data from different sources more easily.
- A validation framework. PEP provides formal validation schemas. This allows us to confirm that a PEP complies with a requirements for an arbitrary tool.
- Project and sample modifiers. PEP provides a powerful framework to programmatically modify sample- and project-level metadata. This allows us to systematize metadata so one input source can span multiple tools.
Definitions of terms and components of a PEP
A PEP can be made from any collection of metadata represented in tabular form. Typically, a PEP represents a data-intensive bioinformatics project with many samples, like individuals or cell lines. The key terms are:
- Project: a collection of metadata that annotates a set of samples.
- Sample: loosely defined; a unit that can be collected into a project, usually with one or more data files.
- PEP specification: the way to organize project and sample metadata in files using a
yaml
+csv
format. - PEP: a project that follows the PEP specification.
The PEP specification divides metadata into components: sample metadata, which can vary by sample, and project metadata, which applies to all samples. These components are stored in separate files. A complete PEP consists of up to 3 files:
- Project config file - RECOMMENDED. a
yaml
file containing project-level metadata - Sample table - RECOMMENDED. a
csv
file of sample metadata, with 1 row per sample - Subsample table - OPTIONAL. A
csv
file of sample with multiple rows for each sample, used to specify sample attributes with multiple values (e.g. used to point to inputs in sequencing experiments when split across multiple files).
This document describes each of these 3 files in detail.
Validating a PEP
PEP uses an extended JSON Schema vocabulary with novel sample metadata features. The formal PEP spec is described as a schema at schema.databio.org/pep/2.1.0.yaml. You can validate a PEP against any PEP schema using the eido Python package like this:
eido validate path/to/your/PEP_config.yaml -s https://schema.databio.org/pep/2.1.0.yaml
The generic schema may be easily extended into a more specific schema that adds new requirements or optional attributes, requires input files, and so forth. You can find more detail about how to extend and use these schemas in the how-to guide for PEP validation.
Project config file specification
The project config file is the source of project-level information. It is the only required file and must be in yaml
format. The config file includes five recognized project attributes, most being optional:
pep_version
- REQUIREDsample_table
- RECOMMENDEDsubsample_table
- OPTIONALsample_modifiers
- OPTIONALproject_modifiers
- OPTIONAL
These attributes may appear in any order.
Example
pep_version: 2.1.0
sample_table: "path/to/sample_table.csv"
subsample_table: ["path/to/subsample_table.csv", "path/to/subsample_table2.csv"]
sample_modifiers:
append:
attribute1: value
attr2: val2
duplicate:
oldattr: newattr
imply:
- if:
genome: ["hg18", "hg19", "hg38"]
then:
organism: "human"
derive:
attributes: [read1, read2, other_attr]
sources:
key1: "path/to/derived/value/{attribute1}"
key2: "path/to/derived/value/{attr2}"
project_modifiers:
amend:
variant1:
sample_table: "path/to/alternative_table.csv"
import:
- external_pep.yaml
- http://url.com/pep.yaml
Project attribute: pep_version
The only required project attribute, which documents the version of the PEP specification this PEP complies with. For PEP version 2.1.0, this must be the string "2.1.0"
.
Project attribute: sample_table
The sample_table
is a path (string) to the sample csv file. It can be absolute or relative; relative paths are assumed relative to the location of the project_config.yaml
file. The target file is expected to comply with the PEP specification for the sample table, described later.
Project attribute: subsample_table
The subsample_table
is a path (string) to the subsample csv file or, in case the subsamples are dispersed across multiple annotation sheets, a collection of paths (array of strings). Like with the sample_table attribute, relative paths are assumed relative to the location of the project_config.yaml
file. The target file is expected to comply with the PEP specification for the subsample table.
Project attribute: sample_modifiers
The sample modifiers allows you to modify sample attributes from within the project configuration file. You can use this to add new attributes to samples in a variety of ways, including attributes whose value varies depending on values of existing attributes, or whose values are composed of existing attribute values. This is a key feature of PEP that allows you to make the sample tables more portable. There are 5 subsections corresponding to 5 types of sample modifier: remove
, append
, duplicate
, imply
, and derive
; and the samples will be modified in that order. Within each modifier, samples will be modified in the order in which the commands are listed.
Sample modifier: remove
The remove
modifier elimiantes one or more sample attributes.
Example:
sample_modifiers:
remove:
- read_type
- organism
This example eliminates read_type
and organism
attributes from each sample. This modifier is useful when one is in need to override an attribute with another on-the-fly. This allows that without editing the annotation sheet by hand.
Sample modifier: append
The append
modifier adds additional sample attributes with a constant value across all samples.
Example:
sample_modifiers:
append:
read_type: SINGLE
This example adds a read_type
attribute to each sample, with the value SINGLE
for all samples. This modifier is useful on its own to add constant attributes, and can also be combined with derive
and/or imply
.
Sample modifier: duplicate
The duplicate
modifier copies an existing sample attribute to a new attribute with a different name. This can be useful if you need to tweak a PEP to work under a different tool that specifies a different schema for the same data.
Example:
sample_modifiers:
duplicate:
old_attribute_name: new_attribute_name
This example would copy the value of old_attribute_name
to a new attribute called new_attribute_name
.
Sample modifier: imply
The imply
modifier adds sample attributes with values that depends on the value of an existing attribute. Under the imply
keyword is a list of items. Each item has an if
section and a then
section. The if
section defines one or more attributes, each with one or more possible values. If all attributes listed have any of the values in the list for that attribute, then the sample passes the conditional and the implied attributes will be added. One or more attributes to imply are listed under the then
section as key: value
pairs.
Example:
sample_modifiers:
imply:
- if:
organism: "human"
then:
genome: "hg38"
macs_genome_size: "hs"
This example will take any sample with organism
attribute set to the string "human" and add attributes of genome
(with value "hg38") and macs_genome_size
(with value "hs"). This example shows only 1 implication, but you can include as many as you like.
Implied attributes can be useful for pipeline arguments. For instance, it may be that one sample attribute implies several more. Rather than encoding these each as separate columns in the annotation sheet for a particular pipeline, you may simply indicate in the project_config.yaml
that samples of a certain type should automatically inherit additional attributes. For more details, see how to eliminate project-level attributes from a sample table.
Sample modifier: derive
The derive
sample modifier converts existing sample attribute values into new values derived from other existing sample attribute values. It contains two sections; in attributes
is a list of existing attributes that should be derived; in sources
is a mapping of key-value pairs that defines the templates used to derive the new attribute values. The sources
templates are available for all entries under attributes
.
Example:
sample_modifiers:
derive:
attributes: [read1, read2, data_1]
sources:
key1: "/path/to/{sample_name}_{sample_type}.bam"
key2: "/from/collaborator/weirdNamingScheme_{ext_id}.fastq"
key3: "${HOME}/{test_id}.fastq"
In this example, the samples should already have attributes named read1
, read2
, and data_1
, which are flagged as attributes to derive. These attribute values should originally be set to one of the keys in the sources
section: key1
, key2
, or key3
. The derive
modifier will replace any samples set as key1
with the specified string ("/path/to/{sample_name}_{sample_type}.bam"
), but with variables like {sample_name}
populated with the values of other sample attributes. The variables in the file paths are formatted as {variable}
, and are populated by sample attributes (columns in the sample annotation sheet). For example, your files may be stored in /path/to/{sample_name}.fastq
, where {sample_name}
will be populated individually for each sample in your PEP. You can also use shell environment variables (like ${HOME}
) or wildcards (like *
).
Using derive
is a powerful and flexible way to point to data files on disk. This enables you to point to more than one input file for each sample. For more details and a complete example, see how to eliminate paths from the sample table.
Project attribute: project_modifiers
The project modifiers allows you to modify project-level attributes from within the project configuration file. There are 2 subsections corresponding to 2 types of project modifier: import
and amend
. Imports run first, followed by amendments.
Project modifier: import
The import
project modifier allows the config file to import other external PEP config files. The values in the imported files will be overridden by the corresponding entries in the current config file. Imports are recursive, so an imported file that imports another file is allowed; the imports are resolved in cascading order with the most distant imports happening first, so the closest configuration options override the more distant ones.
Example:
project_modifiers:
import:
- path/to/parent_project_config.yaml
Imports can be used to record and manage complex analysis relationships among analysis components. In a sense, imports are the opposite of amendments, because they allow combining multiple PEP files into one. When used in combination with amendments, they make it possible to orchestrate very powerful analysis. For more information, see how to integrate imports and amendments.
Project modifier: amend
The amend
project modifier specifies multiple variations of a project within one file. When a PEP is parsed, you may select one or more included amendments, which will amend the values in the processed PEP. Unlike all other sample or project modifiers, amendments are optional and must be activated individually when the PEP is loaded.
For example:
sample_table: annotation.csv
project_modifiers:
amend:
my_project2:
sample_table: annotation2.csv
my_project3:
sample_table: annotation3.csv
...
If you load this configuration file, by default it sets sample_table
to annotation.csv
. If you don't activate any amendments, they are ignored. But if you choose, you may activate one of the two amendments, which are called my_project2
and my_project3
. If you activate my_project2
, by passing amendments=my_project2
when parsing the PEP, the resulting object will use the annotation2.csv
sample_table instead of the default annotation.csv
. All other project settings will be the same as if no amendment was activated because there are no other values specified in the my_project2
amendment.
Amendments are useful to define multiple similar projects within a single project config file. Under the amendments key, you specify names of amendments, and then underneath these you specify any project config variables that you want to override for that particular amendment. It is also possible to activate more than one amendment in priority order, which allows you to combine different project features on-the-fly. For more details, see how to mix and match amendments.
Sample table specification
The sample_table
is a .csv
file containing information about all samples (or pieces of data) in a project. A sample table may contain any number of columns with any column names. Each column corresponds to an attribute of a sample. For this reason, we sometimes use the word column
and attribute
interchangeably.
Sample table index specification for sample identification
Samples tables must include an identifier attribute, or index, which specifies unique strings identifying each sample. This should be a string without whitespace. By default, PEP uses sample_name
column as the index for the sample table, but this can be changed in the project configuration. The sample table index selection priority order is:
- Value specified in Project constructor
- Value specified in Config with
sample_table_index
attribute - Default value (
sample_name
)
Typically, one row corresponds to one sample, so the sample_name attribute would be unique in the table; however, PEP v2.1.0 allows multiple rows per sample as a way to specify multi-value attributes. Here are some examples of both approaches: First, here is a table with one row per sample:
"sample_name","protocol","organism","flowcell","lane", "data_source"
"albt_0h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_1h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_2h","RRBS","albatross","BSFX0190","1","bsf_sample"
"albt_3h","RRBS","albatross","BSFX0190","1","bsf_sample"
"frog_0h","RRBS","frog","","","frog_data"
"frog_1h","RRBS","frog","","","frog_data"
"frog_2h","RRBS","frog","","","frog_data"
"frog_3h","RRBS","frog","","","frog_data"
In a table with duplicate values in the index column, the rows with the same identifier will be merged into a single sample, with potentially many values for other attributes. Here's an example where the sample_name
column has a duplicated albt_1h
value.
"sample_name","organism","flowcell","lane"
"albt_0h","albatross","BSFX0190","1"
"albt_1h","albatross","BSFX0190","1"
"albt_1h","albatross","BSFX0190","2"
"albt_2h","albatross","BSFX0190","1"
This table has 4 rows, but the processed PEP has only 3 samples. The albt_1h
sample has the following attributes:
organism: `albatross`
flowcell: `BSFX0190`
lane: [`1`, `2`]
A sample table with no attributes satisfies the generic PEP requirement, but it isn't really useful. Therefore, tools that use PEPs should make use of the PEP validation framework to specify further requirements. For more details, see the how-to guide for PEP validation.
Subsample table specification
For users who prefer to keep to one-row-per-sample, PEP can accommodate multi-value attributes with a subsample_table
, a second .csv
file. This approach keeps the multi-layered data structure out of the sample table, keeping it cleaner and simpler at the cost of an additional csv file. In the subsample table, multiple values for an attribute are specified as multiple rows with the same sample name. The subsample table contains an index column that maps to the index column in the sample table. This may be configured with the subsample_table_index
value in the project configuration.
One common use case for subsample tables is for when samples have multiple input files of the same type. For example, in a sequencing experiment, it's common to split samples across multiple sequencing lanes, which each yield a separate file. Subsample tables are one way to associate many files to a single sample attribute.
Here's a simple example. If you define the sample_table
like this:
sample_name,library
frog_1,anySampleType
frog_2,anySampleType
Then point subsample_table
to the following, which maps sample_name
to a new column called file
sample_name,file
frog_1,data/frog1a_data.txt
frog_1,data/frog1b_data.txt
frog_1,data/frog1c_data.txt
frog_2,data/frog2a_data.txt
frog_2,data/frog2b_data.txt
This sets up a simple relational database that maps multiple files to each sample. You can also combine a subsample table with derived attributes; attributes will first be derived and then merged, leading to a very flexible way to point to many files of a given type for single sample.
Subsample table index
By default, PEP uses subsample_name
and sample_name
columns as the indexes for the subsample table. However, it is possible to use a custom column as the sample table index, which can be specified with subsample_table_index
attribute or on-the-fly at the project creation stage.
This is the subsample table index selection priority order:
- Value specified in Project constructor
- Value specified in Config
- Default value (
subsample_name
andsample_name
)