How to write a PEP schema

If you are a tool developer, we recommend you write a PEP schema that describes what sample and project attributes are required for your tool to work. PEP schemas use the JSON Schema vocabulary, plus some additional features. This guide will walk you through everything you need to know to write your own schema. It assumes you already have a basic familiarity with JSON Schema.

Importing the base PEP schema

One of the features added by eido is the imports attribute. This allows you to extend existing schemas. We recommend your new PEP schema start by importing the base PEP schema. This will ensure that the putative PEP at least follows the basic PEP specification, which you will then build on with your tool-specific requirements. Here's how we'll start with importing the generic base PEP schema:

description: A example schema for a pipeline.
imports:
  - http://schema.databio.org/pep/2.0.0.yaml

You can also use the imports to build other schemas that subclass your own schemas.

Project and sample sections

Like the PEP itself, the schema is divided into two sections, one for the project config, and one for the samples. So, base PEP schema defines an object with two components: a config object, and a samples array:

description: A example schema for a pipeline.
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  config:
    type: object
  samples:
    type: array
required:
  - samples
  - config

Required sample attributes

Let's say you're writing a PEP-compatible tool that requires 3 arguments: read1, read2, and genome, and also offers optional argument read_length. Validating the generic PEP specification will not confirm all required attributes, so you want to write an extended schema. Starting from the base above, we're not changing the config section so we can drop that, and we add new parameters for the required sample attributes like this:

description: A example schema for a pipeline.
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        read1:
          type: string
          description: "Fastq file for read 1"
        read2:
          type: string
          description: "Fastq file for read 2"
        genome:
          type: string
          description: "Refgenie genome registry identifier"
        read_length:
          type: integer
          description: "Length of the Unique Molecular Identifier, if any"
      required:
        - read1
        - read2
        - genome
required:
  - samples

This document defines the required an optional sample attributes for this pipeline. That's all you need to do, and your users can validate an existing PEP to see if it meets the requirements of your tool.

Required input files

In the above example, we listed read1 and read2 attributes as required. This will enforce that these attributes must be defined on the samples, but for this example, this is not enough -- these also must point to files that exist. Checking for files is outside the scope of JSON Schema, which only validates JSON documents, so eido extends JSON Schema with the ability to specify which attributes should point to files.

Eido provides two ways to do it: sizing and tangible. The basic sizing is simply used to specify which attributes point to files, which are not required to exist. This is useful for tools that want to calculate the total size of any provided inputs, for example. The tangible list specifies that the attributes point to files that must exist, otherwise the PEP doesn't validate. Here's an example of specifying an optional and required input attribute:

description: A PEP for ATAC-seq samples for the PEPATAC pipeline.
imports:
  - http://schema.databio.org/pep/2.0.0.yaml
properties:
  samples:
    type: array
    items:
      type: object
      properties:
        sample_name:
          type: string
          description: "Name of the sample"
        organism:
          type: string
          description: "Organism"
        protocol:
          type: string
          description: "Must be an ATAC-seq or DNAse-seq sample"
        genome:
          type: string
          description: "Refgenie genome registry identifier"
        read_type:
          type: string
          description: "Is this single or paired-end data?"
          enum: ["SINGLE", "PAIRED"]
        read1:
          type: string
          description: "Fastq file for read 1"
        read2:
          type: string
          description: "Fastq file for read 2 (for paired-end experiments)"
      tangible:
        - read1
      files:
        - read1
        - read2

This could a valid example for a pipeline that accepts either single-end or paired-end data, so read1 must point to a file, whereas read2 isn't required, but if it does point to a file, then this file is also to be considered an input file.

Example schemas

If you need more information, it would be a good idea to look at example schemas for ideas.