Setting up pipestat
Introduction
In our previous tutorials, we deployed the count_lines.sh
pipeline.
The result of that pipeline was the number of provinces in several countries, which was simply printed into the log file.
In a real-life pipeline, we usually don't just want to dig results out of log files.
We would like to view the results in nice, tabular or HTML-based report.
Looper does this through pipestat, another component in PEPkit.
Like the other PEPkit components, pipestat is a standalone tool.
You can read the complete details about pipestat as a standalone tool in the pipestat documentation.
You can use pipestat without using looper, and vice versa, but using pipestat alongside looper unlocks a set of helpful tools such as html reports via looper report
.
This tutorial will show you how to do that.
Learning objectives
- What is pipestat? Why is it useful?
- What is a pipestat-compatible pipeline?
- How can I configure my looper workspace to use a pipestat-compatible pipeline effectively?
- Where are pipestat results saved? How can I store results of my pipeline somewhere else?
- How can I make my pipeline store results in PEPhub?
- Can looper monitor whether my jobs are still running, failed, or already completed?
A basic pipestat-compatible pipeline
All the options and features in this tutorial require a pipestat-compatible pipeline. What does that mean? Configuring a pipeline to use pipestat will go into detail about how to make a pipeline pipestat-compatible. Briefly, it just means these 3 criteria are fulfilled:
- The pipeline specifies a pipestat output schema. This just tells pipestat what results a pipeline can report.
- The pipeline uses
pipestat
to report results. - The looper pipeline interface specifies the path to that output schema in the
output_schema
key, like this:
pipeline_name: count_lines
output_schema: pipestat_output_schema.yaml
sample_interface:
command_template: >
pipeline/count_lines.sh {sample.file_path} {sample.sample_name} {pipestat.config_file}
A pipeline that satisfies these criteria is pipestat-compatible, and for these pipelines, looper can give you a nice, web browsable report of results. It can also help you manage job status of your runs.
To demonstrate, let's use a modified version of our count_lines
pipeline that has been made pipestat-compatible.
Navigate to the pipestat_example from the hello_looper repo.
Configure where pipestat results will be stored
One goal of pipestat is that it allows you to configure a pipeline to store results in different places. You can either store results in a simple file, in a database, or in PEPhub. We'll start with the simplest option and configure pipestat to use a results file. Configure pipestat to use a results file with these lines in the looper config file:
pep_config: metadata/pep_config.yaml
output_dir: results
pipeline_interfaces:
- pipeline/pipeline_interface.yaml
pipestat:
results_file_path: results.yaml
This instructs looper to configure pipestat to store the results in a .yaml
file.
Looper will now configure the pipeline to report results into a results.yaml
file.
Execute the run with:
looper run
You can now see the results reported in the results.yaml
output file.
Reporting results back to a database
Using results file and a database backend
If you provide database credentials and a results file path, the results file path will take priority and results will only be reported to the local file.
PostgreSQL
Pipestat also supports PostgreSQL databases as a backend. You will need to set up your own database or be provided the credentials to an existing database.
Using docker to set up a temporary PostgreSQL database
If you are comfortable using docker, you can quickly set up an instance of a PostgreSQL database using the following command:
docker run --rm -it --name looper_tutorial \
-e POSTGRES_USER=looper_test_user \
-e POSTGRES_PASSWORD=looper_test_pw \
-e POSTGRES_DB=looper-test-db \
-p 127.0.0.1:5432:5432 \
postgres
Once you have those credentials, you can configure pipestat to use those credentials in the looper config file:
pep_config: metadata/pep_config.yaml
output_dir: results
pipeline_interfaces:
- pipeline/pipeline_interface.yaml
pipestat:
database:
dialect: postgresql
driver: psycopg2
name: looper-test-db
user: looper_test_user
password: looper_test_pw
host: 127.0.0.1
port: 5432
SQLite
You can also report results to a SQLite database. You will need to provide a path to the local SQLite database.
pep_config: metadata/pep_config.yaml
output_dir: results
pipeline_interfaces:
- pipeline/pipeline_interface.yaml
pipestat:
database:
sqlite_url: "sqlite:///yourdatabase.sqlite3"
Once the database credentials are added for either PostgreSQL or SQLite backends, execute the run with:
looper run
Using a database browser, you will now be able to view the reported results within the database of your choice.
Reporting results back to PEPhub
In the previous tutorial, you configured looper to read sample metadata from PEPhub.
Now, by adding in pipestat integration, we can also report pipeline results back to PEPhub.
In this example, we'll report the results back to the demo PEP we used earlier, databio/pipestat_demo:default
.
But you won't be able to report the results back to the demo repository because you don't have permissions.
So if you want to follow along, you'll first need to create your own PEP on PEPHub to hold these results.
Then, you can run this section yourself by replacing databio/pipestat_demo:default
with the registry path to a PEP you control.
To configure pipestat to report results to PEPhub instead of to a file, we just change our looper config to point to a pephub_path
:
pep_config: metadata/pep_config.yaml
output_dir: results
pipeline_interfaces:
- pipeline/pipeline_interface.yaml
pipestat:
pephub_path: "databio/pipestat_demo:default"
flag_file_dir: results/flags
No other changes are necessary.
You will have to authenticate with PEPhub using phc login
, and then looper will pass along the information in the generated pipestat config file.
Pipestat will read the pephub_path
from the config file and report results directly to PEPhub using its API!
Generating result reports
Now that you have your first pipestat pipeline configured with looper, there are many other, more powerful things you can add to make this even more useful. For example, now that looper knows the structure of results your pipeline reports, it can automatically generate beautiful, project-wide results summary HTML pages for you.
HTML reports
Looper provides an easy report
command that creates an html report of all reported results.
You've already configured everything above.
To get the report, run the command:
looper report
This command will call pipestat summarize
on the results located in your results location. In this case, the results.yaml
file.
Here is an example html report for the above tutorial examples: count lines report
A more advanced example of an html report using looper report
can be found here: PEPATAC Gold Summary
Create tables and stats summaries
Having a nice HTML-browsable record of results is great for human browsing, but you may also want the aggregated results in a machine-readable form for downstream analysis.
Looper can also create summaries in a computable format as .tsv
and .yaml
files.
Run:
looper table
This will produce a .tsv
file for aggregated primitive results (integers, strings, etc), as well as a .yaml
file for any aggregated object results:
Looper version: 2.0.0
Command: table
Using looper config (.looper.yaml).
Creating objects summary
'count_lines' pipeline stats summary (n=4): results/count_lines_stats_summary.tsv
'count_lines' pipeline objects summary (n=0): results/count_lines_objs_summary.yaml
Setting and checking status
Besides reporting results, another feature of pipestat is that it allows users to set pipeline status. If your pipeline uses pipestat to set status flags, then looper can be used to check the status of pipeline runs. To check the status of all samples, use:
looper check
For this example, the 'running' flag doesn't really help because the pipeline runs so fast that it immediately finishes.
But in a pipeline that will take minutes or hours to complete, it can be useful to know how many and which jobs are running.
That's why looper check
can be helpful for these long-running pipelines.
Do I have to use pipestat?
No. You can use looper just as we did in the first two tutorials to run any command.
Often, you'll want to use looper to run an existing pipeline that you didn't create.
In that case, you won't have the option of using pipestat, since you're unlikely to go to the effort of adapting someone else's pipeline to use it.
For non-pipestat-compatible pipelines, you can still use looper to run pipelines, but you won't be able to use looper report
or looper check
to manage their output.
What benefits does pipestat give me?
If you are developing your own pipeline, then you might want to consider using pipestat in your pipeline.
This will allow users to use looper check
to check on the status of pipelines.
It will also enable looper report
and looper table
to create summarized outputs of pipeline results.
Summary
- Pipestat is a standalone tool that can be used with or without looper.
- Pipestat standardizes reporting of pipeline results. It provides a standard specification for how pipeline outputs should be stored; and an implementation to easily write results to that format from within Python or from the command line.
- A pipeline user can configure a pipestat-compatible pipeline to record results in a file, in a database, or in PEPhub.
- Looper synergizes with pipestat to add powerful features such as checking job status and generating html reports.