Project models
peppy
models projects and samples as Python objects.
import peppy
my_project = peppy.Project("path/to/project_config.yaml")
my_samples = my_project.samples
Once you have your project and samples in your Python session, the possibilities are endless. For example, one way we use these objects is for post-pipeline processing. After we use looper to run each sample through its pipeline, we can load the project and it sample objects into an analysis session, where we do comparisons across samples.
Exploration:
To interact with the various models
and become acquainted with their
features and behavior, there is a lightweight module that provides small
working versions of a couple of the core objects. Specifically, from
within the tests
directory, the Python code in the tests.interactive
module can be copied and pasted into an interpreter. This provides a
Project
instance called proj
and a PipelineInterface
instance
called pi
. Additionally, this provides logging information in great detail,
affording visibility into some what's happening as the models
are created
and used.
Extending sample objects
By default we use generic models (see Peppy API docs for more) that can be used in many contexts via Python import, or by object serialization and deserialization via YAML.
Since these models provide useful methods to store, update, and read attributes in the objects created from them
(most notably a sample - Sample
object), a frequent use case is during the run of a pipeline.
A pipeline can create a more custom Sample
model, adding or altering properties and methods.
Use case
You have several samples, of different experiment types,
each yielding different varieties of data and files. For each sample of a given
experiment type that uses a particular pipeline, the set of file path types
that are relevant for the initial pipeline processing or for downstream
analysis is known. For instance, a peak file with a certain genomic location
will likely be relevant for a ChIP-seq sample, while a transcript
abundance/quantification file will probably be used when working with a RNA-seq
sample. This common situation, in which one or more file types are specific
to a pipeline and analysis both benefits from and is amenable to a bespoke
Sample
type.
Rather than working with a base Sample
instance and
repeatedly specifying paths to relevant files, those locations can be provided
just once, stored in an instance of the custom Sample
type, and later
used or modified as needed by referencing a named attribute on the object.
This approach can dramatically reduce the number of times that a full filepath
must be precisely typed, improving pipeline readability and accuracy.
Mechanics
It's the specification of both an experiment or data type ("library" or
"protocol") and a pipeline with which to process that input type that
looper
uses to determine which type of Sample
object(s) to create for
pipeline processing and analysis (i.e., which Sample
extension to use).
There's a pair of symmetric reasons for this--the relationship between input
type and pipeline can be one-to-many, in both directions. That is, it's
possible for a single pipeline to process more than one input type, and a
single input type may be processed by more than one pipeline.
There are a few different Sample
extension scenarios. Most basic is the
one in which an extension, or subtype, is neither defined nor needed--the
pipeline author does not provide one, and users do not request one. Almost
equally effortless on the user side is the case in which a pipeline author
intends for a single subtype to be used with her pipeline. In this situation,
the pipeline author simply implements the subtype within the pipeline module,
and nothing further is required--of the pipeline author or of a user! The
Sample
subtype will be found within the pipeline module, and the inference
will be made that it's intended to be used as the fundamental representation
of a sample within that pipeline.
If a pipeline author extends the baseSample
type in the pipeline module, it's
likely that the pipeline's proper functionality depends on the use of that subtype.
In some cases, though, it may be desirable to use the base Sample
type even if
the pipeline author has provided a more customized version with the pipeline.
To favor the base Sample
over the tailored one created by a pipeline author,
the user may simply set sample_subtypes
to null
in an altered version of the pipeline
interface, either for all types of inpute to that pipeline, or just a subset.
# atacseq.py
import os
from peppy import Sample
class ATACseqSample(Sample):
"""
Class to model ATAC-seq samples based on the generic Sample class.
:param pandas.Series series: data defining the Sample
"""
def __init__(self, series):
if not isinstance(series, pd.Series):
raise TypeError("Provided object is not a pandas Series.")
super(ATACseqSample, self).__init__(series)
self.make_sample_dirs()
def set_file_paths(self, project=None):
"""Sets the paths of all files for this sample."""
# Inherit paths from Sample by running Sample's set_file_paths()
super(ATACseqSample, self).set_file_paths(project)
self.fastqc = os.path.join(self.paths.sample_root, self.name + ".fastqc.zip")
self.trimlog = os.path.join(self.paths.sample_root, self.name + ".trimlog.txt")
self.fastq = os.path.join(self.paths.sample_root, self.name + ".fastq")
self.trimmed = os.path.join(self.paths.sample_root, self.name + ".trimmed.fastq")
self.mapped = os.path.join(self.paths.sample_root, self.name + ".bowtie2.bam")
self.peaks = os.path.join(self.paths.sample_root, self.name + "_peaks.bed")
To leverage the power of a Sample
subtype, the relevant model is the
PipelineInterface
. For each pipeline defined in the pipelines
section
of pipeline_interface.yaml
, there's accommodation for a sample_subtypes
subsection to communicate this information. The value for each such key may be
either a single string or a collection of key-value pairs. If it's a single
string, the value is the name of the class that's to be used as the template
for each Sample
object created for processing by that pipeline. If instead
it's a collection of key-value pairs, the keys should be names of input data
types (as in the protocol_mapping
), and each value is the name of the class
that should be used for each sample object of the corresponding keyfor that
pipeline. This underscores that it's the combination of a pipeline and input
type that determines the subtype.
# Content of pipeline_interface.yaml
protocol_mapping:
ATAC: atacseq.py
pipelines:
atacseq.py:
...
...
sample_subtypes: ATACseqSample
...
...
...
...
If a pipeline author provides more than one subtype, the sample_subtypes
section is needed to select from among them once it's time to create
Sample
objects. If multiple options are available, and the
sample_subtypes
section fails to clarify the decision, the base/generic
type will be used. The responsibility for supplying the sample_subtypes
section, as is true for the rest of the pipeline interface, therefore rests
primarily with the pipeline developer. It is possible for an end user to
modify these settings, though.
Since the mechanism for subtype detection is inspect
-ion of each of the
pipeline module's classes and retention of those which satisfy a subclass
status check against Sample
, it's possible for pipeline authors to
implement a class hierarchy with multi-hop inheritance relationships. For
example, consider the addition of the following class to the previous example
of a pipeline module atacseq.py
:
class DNaseSample(ATACseqSample):
...
In this case there are now two Sample
subtypes available, and more
generally, there will necessarily be multiple subtypes available in any
pipeline module that uses a subtype scheme with multiple, serial inheritance
steps. In such cases, the pipeline interface should include an unambiguous
sample_subtypes
section.
# Content of pipeline_interface.yaml
protocol_mapping:
ATAC: atacseq.py
DNase: atacseq.py
pipelines:
atacseq.py:
...
...
sample_subtypes:
ATAC: ATACseqSample
DNase: DNaseSample
...
...
...
...