Running project-level pipelines

So far, our example has been what we call a sample-level pipeline. The count_lines pipeline from our basic tutorial runs one job per sample, and we want to do the same thing to each sample.

This is the most common use case for looper.

But sometimes, we need to run a single job on an entire project. Looper calls that a project-level pipeline.

Here, we'll walk through a basic example of running a project-level pipeline using looper.

A project-level pipeline is meant to run on the entire group of your samples, not each sample individually.

First we need to create a project-level pipeline. We would like to take the number of regions from each sample (file) and collect them into a single plot. This how-to guide assumes you've completed the tutorial, using a pep for derived attributes. You can always download the completed tutorial files from the hello looper repository.

Ensure you are in the pep_derived_attrs folder from the previous tutorial or after downloading the folder from the hello looper repository.

cd pep_derived_attrs

Next run the sample level pipeline so that the pipeline log files are available:

looper run

You can confirm the existence of log files by checking the results/submission folder:

ls results/submission/

count_lines_canada.log  count_lines_mexico.log  count_lines_switzerland.log 
count_lines_canada.sub  count_lines_mexico.sub  count_lines_switzerland.sub

If you open one of the .log files, you'll see that each contains the pipeline's results for this sample:

Number of lines: 10

This is because our sample-level pipeline prints the number of lines to the terminal. Looper logs this terminal output to a .log file. While not intended to be used as a results file, we can use the .log file in this case as an example.

First, we will need to add a project_interface to the pipeline_interface contained within pep_derived_attrs/pipeline:

pipeline_interface.yaml

pipeline_name: count_lines
sample_interface:
  command_template: >
    pipeline/count_lines.sh {sample.file_path}
project_interface:
  command_template: >
    python3 {looper.piface_dir}/count_lines_plot.py {looper.output_dir}/submission/

Next, we must create our pipeline. Create a python file in the pipeline folder named count_lines_plot.py:

count_lines_plot.py

import matplotlib.pyplot as plt
import os
import sys

results_dir = sys.argv[1] # Obtain the looper results directory passed via the looper command template

# Extract the previously reported sample-level data from the .log files
countries = []
number_of_regions = []
for filename in os.listdir(results_dir):
    if filename.endswith('.log'):
        file = os.path.join(results_dir, filename)
        with open(file, 'r') as f:
            for line in f:
                if line.startswith('Number of lines:'):
                    region_count = int(line.split(':')[1].strip())
                    number_of_regions.append(region_count)
                    country = filename.split('_')[2].split('.')[0]
                    countries.append(country)

# Create a bar chart of regions per country
plt.figure(figsize=(8, 5))
plt.bar(countries, number_of_regions, color=['blue', 'green', 'purple'])
plt.xlabel('Countries')
plt.ylabel('Number of regions')
plt.title('Number of regions per country')

# Save the image locally
save_location =  os.path.join(os.path.dirname(results_dir), "regions_per_country.png")
plt.savefig(save_location, dpi=150)

Make sure count_lines_plot.py is executable:

chmod 755 pipeline/count_lines_plot.py

If you run the project-level pipeline with the following command, looper runp -c .looper.yaml, the project-level pipeline will run and a regions_per_country.png should appear in your results folder:

Generated image

That is how you can run a project-level pipeline via looper!

You can find even more details about project-level pipelines in advanced run options.