Setting up a looper project
Introduction
This guide will walk you through creating a Looper workspace to run a pipeline on a dataset. Looper is ideal for processing data split into independent samples, allowing you to run the same pipeline in parallel on each sample. In our demonstration, we'll work with three sample files, each representing a different country: Canada, Mexico, and Switzerland. Each file contains a list of the states or provinces in that country. In this context, each country is considered a "sample," and our goal is to run a pipeline on each sample. The sample pipeline we’ll use is a shell script that counts the number of lines in its input file. By running this pipeline, we’ll compute the number of states or provinces in each country.
After setting things up, it will be easy to launch all our jobs like this:
cd looper_csv_example
looper run
In a real workspace, you would likely have hundreds or thousands of samples, each with multiple input files. Your pipeline could be an advanced Python or R script that takes hours or days to run. You might be running the jobs on a compute cluster. The principles of setting up and running looper will be the same as for our simple demonstration project.
This User Tutorial does not cover how to actually create a looper-compatible pipeline. If you want to create a new pipeline, that's the goal of the Pipeline Developer Tutorial. Here, we assume you already have a looper-compatible pipeline and and you're trying to run it on some data.
Let's get started!
Learning objectives
- How do I get started using looper?
- What components do I need to run looper?
- What is a looper config file, a pipeline interface, and a metadata table?
Components of a looper workspace
A typical Looper workspace includes the following components:
-
Sample data: Usually stored as one or more files per sample.
-
Sample metadata: A
.csv
file that describes the samples. Looper will run your pipeline once for each row in this table. In a future guide, we’ll explore how to leverage more advanced features ofPEP
to describe your sample metadata. -
Pipeline: The script or command you want to run on each sample file. You'll also need a pipeline interface file that tells looper how to execute the pipeline.
-
Looper configuration file: This file points to the other components and includes any additional configuration options you want to pass to looper.
We'll walk through how to create each of these components. If you'd rather skip creating the files yourself, you can download the completed project from the hello world git repository, which has the files as you'd end with in this tutorial. But we recommend following these instructions the first time because they will help you get a feel for the contents of the files.
Step 1: Set up a directory
Usually, a looper workspace corresponds to a directory. So the first step is to create a new folder for this tutorial. We'll run the commands from within the folder:
mkdir looper_csv_example
cd looper_csv_example
In practice, looper can handle any file paths for any of the components, but to keep things simple in this tutorial, we'll keep all the components in this workspace folder, in subfolders for data, metadata, and pipeline.
Step 2: Grab the data
First, we'll need some data. Download these 3 files from looper csv example and save them in a subfolder called data/
:
mkdir data
wget -P data https://raw.githubusercontent.com/pepkit/hello_looper/tutorial-restructure/looper_csv_example/data/mexico.txt
wget -P data https://raw.githubusercontent.com/pepkit/hello_looper/tutorial-restructure/looper_csv_example/data/canada.txt
wget -P data https://raw.githubusercontent.com/pepkit/hello_looper/tutorial-restructure/looper_csv_example/data/switzerland.txt
Take a peak at the contents of the files. Each lists the states or provinces of the country, with one per line, like this:
head data/mexico.txt -n 5
Aguascalientes
Baja California
Baja California Sur
Campeche
Chiapas
Step 3: Create a metadata table
Looper needs a list of samples in the form of a sample metadata table, which names each sample and includes paths to the data files. Looper will accept a PEP, which we'll discuss more later, or just a simple CSV file, which is where we'll start. Create a new file called metadata/sample_table.csv
and paste this content in it:
sample_name,area_type,file_path
mexico,state,data/mexico.txt
switzerland,canton,data/switzerland.txt
canada,province,data/canada.txt
Each row corresponds to a sample, with a unique identifier under sample_name
, a pointer to its corresponding file in file_path
, and any other information you want to include about the sample (in this case, area_type
). These will be the different values available to pass to your pipeline.
Step 4: Create the pipeline
Our example pipeline is a shell script that counts the lines in an input file. Since our data has one line per province, this script will tell us how many provinces there are in each country. Create a file under pipeline/count_lines.sh
with this content:
#!/bin/bash
linecount=`wc -l $1 | sed -E 's/^[[:space:]]+//' | cut -f1 -d' '`
echo "Number of lines: $linecount"
All this script does is run the unix wc
command, and then parse the output using sed
and cut
, and then print the result with echo
.
In a real workspace, your pipeline is more likely to be a powerful Python script or something else.
The important thing for looper is just that there's a command you can run to execute the pipeline, and you can pass arguments on the command line.
Since looper will need to execute it, make sure your script has execute permission:
chmod 755 pipeline/count_lines.sh
Step 5: Create a pipeline interface
In order to run a pipeline, looper needs to know how to construct the command, which will be different for each sample.
A pipeline developer does this through the pipeline interface.
Our count_lines.sh
pipeline just takes a single argument, which is the file path, so the command would be count_lines.sh path/to/file.txt
.
Here's how to create the appropriate pipeline interface for this pipeline.
Create a file in pipeline/pipeline_interface.yaml
and paste this content into it:
pipeline_name: count_lines
sample_interface:
command_template: >
pipeline/count_lines.sh {sample.file_path}
This is the pipeline interface file.
It specifies the pipeline_name
, which looper uses to keep track of pipelines in case it is running multiple of them.
It also specifies the command_template
, which uses {sample.file_path}
to tell looper to use the value of the file_path
attribute on samples.
Later we'll cover more powerful features you can add to the pipeline interface.
For now, the important part is just that the command_template
is telling looper how to run the count_lines.sh
script we just created, passing it a single argument as a parameter, which it will get from the file_path
column from the sample metadata table.
Step 6: Create the looper config
Now you have all the components finished for your looper project.
The last step is to create the looper configuration file.
The easiest way to do this is to run looper init
from within your workspace folder.
This will walk you through a wizard that will build the file for you.
Here's how to answer the questions
Prompt | You should respond: |
---|---|
Project name: (new_csv) | (leave blank to accept the default) |
Registry path or file path to PEP: (databio/example) | metadata/sample_table.csv |
Path to output directory: (results) | (leave blank to accept the default) |
Add each path to a pipeline interface: (pipeline_interface.yaml) | pipeline/pipeline_interface.yaml |
This will create a file at .looper.yaml
, with this content:
pep_config: metadata/sample_table.csv
output_dir: results
pipeline_interfaces:
- pipeline/pipeline_interface.yaml
Right now, this file is basically just 3 pointers.
The first line points to the metadata table we just created.
The second line points to the output_dir
where looper will store results.
Finally, the pipeline_interfaces
is a list that includes the path to the pipeline interface we just created.
Future tutorials will cover how to use this to run multiple pipelines on each sample, but most of the time you'll just have a single entry here.
You could also create this .looper.yaml
file manually, but the init
function makes it so you don't have to remember the config syntax.
Now that the project is configured, you can run it like this:
looper run
This will submit a job for each sample. You've just run your first looper project!
Basically, all looper does is load you metadata table, and then for each row, populate the command_template
for the sample, and then run the command.
That is the gist of it.
After this, there's just more options and tweaks you can do to control your jobs.
For example, here are some looper arguments:
- Dry runs. You can use
-d, --dry-run
to create the job submission scripts, but not actually run them. This is really useful for testing that everything is set up correctly before you commit to submitting hundreds of jobs. - Limiting the number of jobs. You can
-l, --limit
to test a few before running all samples. You can also use the--selector-*
arguments to select certain samples to include or exclude. - Grouping jobs. You can use
-u, --lump
,-n, --lumpn
,-j, --lumpj
to group jobs. More details on grouping jobs. - Changing compute settings. You can use
-p, --package
,-s, --settings
, or--compute
to change the compute templates. Read more in advanced computing. - Time delay. You can stagger submissions to not overload a submission engine using
--time-delay
. - Use rerun to resubmit jobs. To run only jobs that previously failed, try
looper rerun
. - Tweak the command on-the-fly. The
--command-extra
arguments allow you to pass extra arguments to every command straight through from looper. See advanced run options.
In the upcoming tutorials, we'll cover these and other features in more detail.
Summary
- To use looper, you'll need to have data, metadata, a pipeline (with pipeline interface), and a looper configuration file.
- Looper is best suited for data that is split into samples, where you're trying to run a pipeline independently on each sample.
- Once you've configured everything correctly, you run your samples with the command
looper run
.