Using Bpipe – A tutorial

Preface

This tutorial is part of the "e-Infrastructure for Massively Parallel Sequencing" workshop, organized by SciLifeLab and UPPNEX. It aims at providing a gentle introduction to the design of bioinformatics pipelines using the Bpipe software. For more in-depth information on Bpipe, including a comprehensive list of available commands, see https://code.google.com/p/bpipe/wiki/Overview. In the interest of full disclosure, you'll note that we took the liberty to "borrow" some examples from the same website – no need to reinvent the wheel!

Also note that this tutorial assumes that you are working on the UPPMAX cluster, so that you can use the included setup script to load all the relevant software (tophat2, cufflinks2, bowtie2, samtools).

Prerequisites

For the BPipe tutorial you will need an UPPMAX-account (our local HPC center). To apply for an account, go to https://supr.snic.se/ and select “Register New Person”. Note that you need to apply using an institutional email adress (university, company, governmental etc), and that gmail, hotmail etc. will not be accepted. After your application has been approved, you can log into the system and select “View and Manage Projects” where you can request membership in Projects. Here you should enter “g2015001” and on subsequent page click “Request”. The request will then be manually handled by us.

When your account is approved you will receive a link to a temporary password. You can only click this link once to receive your password so make sure to note it. If you connect to UPPMAX from abroad, see http://www.uppmax.uu.se/using-the-uppmax-gateway

Login to UPPMAX system is done via SSH2. There are guides available at http://www.uppmax.uu.se/support

Introduction

Bpipe is a purpose-built pipeline engine or workflow-manager, specifically designed with bioinformatics in mind. There are perhaps two key features that make Bpipe particularily attractive to bioinformatics users:

Firstly, it requires minimal code to design even complex workflows (we are lazy, after all). In fact, most of a pipeline will consist of simple shell commands that Bpipe will string together following an easy-to-understand graph approach, defined by the user. These processing stages can be referred to as modules and may be included with the pipeline or as a system-wide library, thus allowing to rapidly built a wide range of different workflows from a common library of functional pieces.

Secondly, Bpipe includes many convenience methods and is designed with the concept of implicit input and output in mind. In other words, data will pass through the workflow automatically without requiring specific rules or commands.

The following tutorial will demonstrate some simple use cases for Bpipe in bioinformatics and close with a few suggestions on how to set up more complex pipeline projects.

1. Setting up for the course

Bpipe is a java application designed to run on most UNIX-like systems, most prominently the major flavors of Linux (Redhat, Ubuntu) and Mac OS X. The Bpipe application itself is not complicated to install on a system – you can just drop the folder contiaining Bpipe anywhere and run it from there.

We have included a version of Bpipe with the course data, located here: /proj/g2015001/tutorials/bpipe/bpipe

Copy the tutorial folder to your home directory, e.g.:

mkdir $HOME/bpipe_tutorial

cp -R /proj/g2015001/tutorials/bpipe/* $HOME/bpipe_tutorial

Load the different bioinformatics packages:

source setup/setup_environment.sh

Add the folder containing the Bpipe executable to your PATH:

export PATH=$PATH:$HOME/bpipe_tutorial/bpipe/bpipe-0.9.8.6/bin

The other folders include:

- examples/

The examples discussed below

- rnaseq_pipeline/

A real-world example of a (simple) RNA-seq processing pipeline and all the input data required

- rnaseq_pipeline_v2/

The same pipeline, but done a littler neater.

So, let’s have a look how Bpipe works…

3. Understanding the Bpipe language

Bpipe has two main functions – enabling you to define workflows and then running them for you. Workflows are written as a combination of your standard shell commands wrapped in a tiny bit of Groovy code. Groovy is a scripted language with similarities to Perl or Python, but which lives in the same realm as the Java programming language and runs on the Java Virtual Machine (JVM) which means it has access to some of the goodies that come with Java. Looking at Bpipe from this point of view, it is useful to note that Bpipe is a so-called domain-specific language (or DSL). All this means is that Bpipe is really good for developing code that gets interpreted in a structured way (like a pipeline).

Each stage of a pipeline is defined in the same way:

stage_name = {

exec "shell command to execute your pipeline stage"

}

The stage name defines how this processing step is being referred to in the pipeline (see below) and wraps around the shell command that is to be executed.

4. Running on a cluster

A word on how Bpipe translates your workflow to the cluster:

Bpipe - bpipe config

Bpipe - some basics

Bpipe - Hello World

Bpipe - InOutputs

Bpipe - Branches

A real example

A simple RNAseq pipeline

There is much more to say about Bpipe and how it works. However, after this much theory, let’s try the concepts introduced above to write a ‘real’ bioinformatics pipeline. For this example, we will use a RNA-seq pipeline that takes reads in FastQ format aligns them against the genome, generates statistics from that alignment and quantifies the expression of a set of genes in the aligned region.

The data that you will need for the following exercise is located in the folder rnaseq_pipeline.

What's included in the example:

bpipe.config (a configuration file to run bipe on a cluster)
dmel_L1.left.fq.gz (RNAs-seq reads, left mate)
dmel_L1.right.fq.gz (RNA-seq reads, right mate)
dmel_L3.left.fq.gz
dmel_L3.right.fq.gz
genome (folder with the genome sequence and bowtie index + gene annotations)
rnaseq.bpipe (the actual pipeline)

Basically, we have two RNA-seq samples (Drosophila, larval stages L1 and L3), a genome file and corresponding bowtie2 index (to align reads against), a configuration file to tell Bpipe how to submit jobs to UPPMAX (bpipe.config) and the actual pipeline (rnaseq.bpipe).

The pipeline will do the following things:

  1. Align the reads (L1 and L3) to the bowtie index with Tophat 2
  2. Take the resulting read alignments in BAM format and quantify expression for the gene models
  3. Compute statistics from the read alignment
To describe this workflow, we could write:

run { tophat + cufflinks + samtools_index + samtools_flagstat }

Looks simple enough! But wait, we have two samples (L1 and L3) – how does that work? Do we need to run them one after the other? Of course not, Bpipe is smarter:

run { "%.*.fq.gz" * [tophat + cufflinks + samtools_index + samtools_flagstat] }

The first element in the curly braces tells Bpipe what pattern to use to group input files into groups (in our case paired-end reads belonging to the same sample). The ‘%’ works as a placeholder for the common part of the file names, the ‘*’ is the wild card that denotes where the file names are allowed to be different (in our case ‘left’ and ‘right’ to indicate left and right mate). The rest of the pattern defines the suffix (".fq.gz", a gzipped fastq file).

When executing Bpipe, this will create 2 workflow streams – one for the L1 sample and one for the L2 sample.

bpipe run rnaseq.bpipe *.fq.gz

Note how we pass all the read files to bpipe by simply specifying "*.fq.gz". If you want to only pass a subset of read files in the folder, you could modify this pattern to reflect that.

This pipeline is of course pretty simplistic. For example, we haven’t even put in a branch even though the alignment statistics and quantification do not depend on each other and could happily run in parallel. Let’s update the workflow:

run { "%.*.fq.gz" * [ tophat + [ samtools_index + samtools_flagstat , cufflinks ] }

Now the pipeline will quantify the expression of genes while at the same time also computing some statistics on the read file.

Read the pipeline file and see if everything in there makes sense. Note that the pipeline contains three distinct sections:

- Variables: Location of files, binaries or even paramaters to pass to different tools

- Stages: The definition of the individual stages ('how do I run this stage?!')

- Workflow: The logic of the pipeline, i.e. how to connect the stages

Looks a bit messy, but we'll fix this later.

Execute the pipeline and see if everything works as expected.

bpipe run -r rnaseq.bpipe *.fq.gz

Note the flag '-r' - this will create a report of this pipeline run in doc/index.html.

Looking at the outputs

The pipeline has produced several things

- Output folder for each sample (dmel_L1 and dmel_L3)

- Outputfolder for each stage within each sample folder (cufflinks, tophat, statistics)

In addition, Bpipe did a few other things:

- Created a log file of the commands that were executed (commandLog.txt)

- Created a report file in doc/index.html

- Created a hidden folder called '.bpipe' that holds all the bits and pieces of the workflow execution. This is where you need to look to debug a pipeline etc.

A slightly improved RNA-seq pipeline

As mentioned above, the pipeline file looks messy with all these different parts in it. A better way to organize your pipelines is to keep the three sections (variables, stages, workflow) separated. This has the added advantage of being able to share stages between different pipelines (i.e. it's more modular this way).

We have prepared a second version of the RNA-seq pipeline example, located in the folder "rnaseq_pipeline_v2".

Note the additional files:

bpipe.config
dmel_L1.left.fq.gz
dmel_L1.right.fq.gz
dmel_L3.left.fq.gz
dmel_L3.right.fq.gz
genome
modules
pipeline.config
rnaseq.bpipe

The pipeline variables are now located in a file called 'pipeline.config', which is loaded by rnaseq.bpipe through the statement:

load 'pipeline.config'

But what about the stages? Well, bpipe will look for pipeline stages using the environment variable BPIPE_LIB. Any file matching the pattern '*.groovy' in this location will be parsed for pipeline stages, which are then included when executing a pipeline. This comes in handy if you want to build a library of stages for a wide range of different pipelines . Write once, use multiple times.

Set BPIPE_LIB to the module folder:

export BPIPE_LIB=$HOME/bpipe_tutorial/rnaseq_pipeline_v2/modules

And there you have it - the pipeline workflow (rnaseq.bpipe), the configuration file (pipeline.config) and the pipeline stages ($BPIPE_LIB). Try running the pipeline again.

Some interesing tidbits

Did you know that...?

- You can specify all variables directly from the command line without needing to write a pipeline.cfg file?

With the '-p' flag you can pass whatever variable you like to the workflow. This is especially useful if you run Bpipe as part of a larger workflow system and need to execute it through e.g. an event manager.

- You can create branch variables that are passed to downstream stages

branch.your_branch_name is a built-in function that allows you to pass through certain values for a given branch - like name of the input file, a tag or whatever else you feel the need to keep track of.

- Ouputs can be flagged so that they are not deleted if you run the cleanup command

By adding the 'preserved' flag to a stage, 'bpipe cleanup' will not remove the product of that stage.

Other useful things

This tutorial ends here but bpipe has many more interesting features, many of which are discussed here:

https://code.google.com/p/bpipe/wiki/Overview

Bpipe also has an active discussion group:

https://groups.google.com/forum/#!forum/bpipe-discuss

One of the more exciting Bpipe projects can be seen here:

https://bitbucket.org/drambaldi/bpipe_config/overview

And finally, BILS is working on a pipeline code base that is open to everyone for feedback and contribution:

https://projects.bils.se/projects/bpipe/wiki

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r11 - 2015-01-20 - EInfraMPS2015Org
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2015 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback