Learn sample subannotations in peppy
This vignette will show you how and why to use the subsample table functionality of the peppy
package.
-
basic information about the PEP concept visit the project website.
-
broader theoretical description in the subsample table documentation section.
Problem/Goal
This series of examples below demonstrates how and why to use sample subannoatation functionality in multiple cases to provide multiple input files of the same type for a single sample.
Solutions
Example 1: basic sample subannotation table
This example demonstrates how the sample subannotation functionality is used. In this example, 2 samples have multiple input files that need merging (frog_1
and frog_2
), while 1 sample (frog_3
) does not. Therefore, frog_3
specifies its file in the sample_table.csv
file, while the others leave that field blank and instead specify several files in the subsample_table.csv
file.
This example is made up of these components:
- Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable1/"
project_config = examples_dir + "project_config.yaml"
%cat $project_config
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/example_results
- Sample table:
sample_table = examples_dir + "sample_table.csv"
%cat $sample_table | column -t -s, | cat
sample_name protocol file
frog_1 anySampleType multi
frog_2 anySampleType multi
frog_3 anySampleType multi
- Subsample table:
subsample_table = examples_dir + "subsample_table.csv"
%cat $subsample_table | column -t -s, | cat
column: line too long
sample_name subsample_name file
frog_1 sub_a data/frog1a_data.txt
frog_1 sub_b data/frog1b_data.txt
frog_1 sub_c data/frog1c_data.txt
frog_2 sub_a data/frog2a_data.txt
Let's load the project config, create the Project object and see if multiple files are present
from peppy import Project
p = Project(project_config)
samples = p.sample_table
samples
file | protocol | sample_name | subsample_name | |
---|---|---|---|---|
sample_name | ||||
frog_1 | [data/frog1a_data.txt, data/frog1b_data.txt, d... | anySampleType | frog_1 | [sub_a, sub_b, sub_c] |
frog_2 | [data/frog2a_data.txt, data/frog2b_data.txt] | anySampleType | frog_2 | [sub_a, sub_b] |
frog_3 | multi | anySampleType | frog_3 | NaN |
Example 2: subannotations and derived attributes
This example uses a subsample_table.csv
file and a derived attributes to point to files. This is a rather complex example. Notice we must include the file_id
column in the sample_table.csv
file, and leave it blank; this is then populated by just some of the samples (frog_1
and frog_2
) in the subsample_table.csv
, but is left empty for the samples that are not merged.
This example is made up of these components:
- Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable2/"
project_config = examples_dir + "project_config.yaml"
%cat $project_config
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: [../pipeline/pipeline_interface.yaml]
sample_modifiers:
derive:
attributes: [file]
sources:
local_files: "../data/{identifier}{file_id}_data.txt"
local_files_unmerged: "../data/{identifier}_data.txt"
- Sample table:
sample_table = examples_dir + "sample_table.csv"
%cat $sample_table | column -t -s, | cat
column: line too long
sample_name protocol identifier file
frog_1 anySampleType frog1 local_files
frog_2 anySampleType frog2 local_files
frog_3 anySampleType frog3 local_files_unmerged
- Subsample table:
subsample_table = examples_dir + "subsample_table.csv"
%cat $subsample_table | column -t -s, | cat
column: line too long
sample_name file_id subsample_name
frog_1 a a
frog_1 b b
frog_1 c c
frog_2 a a
Let's load the project config, create the Project object and see if multiple files are present
p = Project(project_config)
samples = p.sample_table
samples
file | file_id | identifier | protocol | sample_name | subsample_name | |
---|---|---|---|---|---|---|
sample_name | ||||||
frog_1 | [../data/frog1a_data.txt, ../data/frog1b_data.... | [a, b, c] | frog1 | anySampleType | frog_1 | [a, b, c] |
frog_2 | [../data/frog2a_data.txt, ../data/frog2b_data.... | [a, b] | frog2 | anySampleType | frog_2 | [a, b] |
frog_3 | ../data/frog3_data.txt | NaN | frog3 | anySampleType | frog_3 | NaN |
frog_4 | ../data/frog4_data.txt | NaN | frog4 | anySampleType | frog_4 | NaN |
Example 3: subannotations and expansion characters
This example gives the exact same results as Example 2, but in this case, uses a wildcard for frog_2
instead of including it in the subsample_table.csv
file. Since we can't use a wildcard and a subannotation for the same sample, this necessitates specifying a second data source class (local_files_unmerged
) that uses an asterisk (*
). The outcome is the same (file
columns match).
This example is made up of these components:
- Project config file:
examples_dir = "../tests/data/example_peps-cfg2/example_subtable3/"
# need to cd to the example dir so that the glob works as expected
%cd $examples_dir
project_config = "project_config.yaml"
%cat $project_config
/Users/mstolarczyk/Uczelnia/UVA/code/peppy/tests/data/example_peps-cfg2/example_subtable3
pep_version: "2.0.0"
sample_table: sample_table.csv
subsample_table: subsample_table.csv
output_dir: $HOME/hello_looper_results
pipeline_interfaces: [../pipeline/pipeline_interface.yaml]
sample_modifiers:
derive:
attributes: [file]
sources:
local_files: "../data/{identifier}{file_id}_data.txt"
local_files_unmerged: "../data/{identifier}*_data.txt"
- Sample table:
%cat sample_table.csv | column -t -s, | cat
sample_name protocol identifier file file_id
frog_1 anySampleType frog1 local_files
frog_2 anySampleType frog2 local_files_unmerged
frog_3 anySampleType frog3 local_files_unmerged
frog_4 anySampleType frog4 local_files_unmerged
- Subsample table:
%cat subsample_table.csv | column -t -s, | cat
sample_name file_id
frog_1 a
frog_1 b
frog_1 c
Let's load the project config, create the Project object and see if multiple files are present
p = Project(project_config)
samples = p.sample_table
samples
file | file_id | identifier | protocol | sample_name | subsample_name | |
---|---|---|---|---|---|---|
sample_name | ||||||
frog_1 | [../data/frog1a_data.txt, ../data/frog1b_data.... | [a, b, c] | frog1 | anySampleType | frog_1 | [0, 1, 2] |
frog_2 | [../data/frog2_data.txt, ../data/frog2a_data.t... | NaN | frog2 | anySampleType | frog_2 | NaN |
frog_3 | ../data/frog3_data.txt | NaN | frog3 | anySampleType | frog_3 | NaN |
frog_4 | ../data/frog4_data.txt | NaN | frog4 | anySampleType | frog_4 | NaN |