How to write a pipestat schema
Introduction
Pipestat requires a schema, in which all the results that the pipeline can report are specified.
It is written in JSON schema which defines specific data types:
Data types
Each result reported by a pipeline must have a specified data type. The supported basic types include:
- string
- number
- integer
- boolean
- null
Pipestat also extends the json schema vocabulary by adding two additional types, which are common results of a pipeline: image and file. These types require reporting objects with the following attributes:
file:path: path to the reported filetitle: human readable description of the file
image:path: path to the reported image, usually PDFthumbnail: path to the reported thumbnail, usually PNG or JPEGtitle: human readable description of the image
Complex objects
Pipestat also supports reporting more complex objects
Unsupported data types
tuples are currently unsupported for reporting and retrieving.
A simple example
The pipestat output schema is a YAML-formatted file. The top level keys are the unique result identifiers. The associated values are jsonschema types. The type attribute is required. This is an example of a minimal component, specifying only an identifier, and its type:
result_identifier:
type: <type>
Here, result_identifier can be whatever name you want to use to identify this result. Here's a simple schema example that showcases most of the supported types:
title: Example Pipestat Output Schema
description: A pipeline that uses pipestat to report sample level results.
type: object
properties:
pipeline_name: "default_pipeline_name"
samples:
type: array
properties: # result identifiers are properties of the samples object
number_of_things:
type: integer
description: "Number of things"
percentage_of_things:
type: number
description: "Percentage of things"
name_of_something:
type: string
description: "Name of something"
switch_value:
type: boolean
description: "Is the switch on or off"
The top level schema is of type object. It contains properties that define samples. Here, the samples's properties are the results. So in the above example, the results that can be reported are: number_of_things,percentage_of_things,name_of_something, and switch_value.
A more complex example
Here's a more complex schema example that showcases some of the more advanced jsonschema features:
title: An example Pipestat output schema
description: A pipeline that uses pipestat to report sample and project level results.
type: object
properties:
pipeline_name: "default_pipeline_name"
samples:
type: array
items:
type: object
properties:
number_of_things:
type: integer
description: "Number of things"
percentage_of_things:
type: number
description: "Percentage of things"
name_of_something:
type: string
description: "Name of something"
switch_value:
type: boolean
description: "Is the switch on or off"
md5sum:
type: string
description: "MD5SUM of an object"
highlight: true
collection_of_images:
description: "This store collection of values or objects"
type: array
items:
properties:
prop1:
description: "This is an example file"
$ref: "#/$defs/file"
output_file_in_object:
type: object
properties:
prop1:
description: "This is an example file"
$ref: "#/$defs/file"
prop2:
description: "This is an example image"
$ref: "#/$defs/image"
description: "Object output"
output_file_in_object_nested:
type: object
description: First Level
properties:
prop1:
type: object
description: Second Level
properties:
prop2:
type: integer
description: Third Level
output_file:
$ref: "#/$defs/file"
description: "This a path to the output file"
output_image:
$ref: "#/$defs/image"
description: "This a path to the output image"
$defs:
image:
type: object
object_type: image
properties:
path:
type: string
thumbnail_path:
type: string
title:
type: string
required:
- path
- thumbnail_path
- title
file:
type: object
object_type: file
properties:
path:
type: string
title:
type: string
required:
- path
- title
In this example, we define reusable type definitions in image and file.
For more details, see pipestat specification.