Using Nextflow#

Nextflow is a workflow framework that allows you to easily create and use computationally heavy microbial bioinformatics pipelines. Nextflow is a “Domain Specific Language” (DSL) built upon Groovy. Some terms that will be used throughout this page:

processes: the task to be run e.g command/script(s)
channels: asynchronous queues of data
workflows/pipelines: a set of processes and channels
modules: ready to use processes

Nextflow allows users to easily join together different processes and scripting languages (Bash, Perl, Ruby, Python etc.). Processes are isolated from each other and executed independently but can interact via input and output data channels.

For further information on how to create modules and pipelines please see the nextflow documentation.

This page will guide you through the basics of Nextflow with the following sections:

Why should you use Nextflow ?#

Accessibility: Nextflow makes bioinformatics accessible for everyone. Using a Nextflow pipeline, instead of performing multiple computational steps manually, you can run a single nextflow command to easily go from raw sequence reads to multiple outputs.
Continuity: Checkpoints within Nextflow pipelines allow you to resume your workflow if its stops.
Reproducibility: Nextflow supports the use of containers.
Parrallisation: You can scale -up analysis easily, running a large number of samples a the same time. Speeding up analysis time !

How to start?#

You'll find an up-to-date, stable version pre-installed in your JupyterLab environment and, more importantly, pre-configured to take advantage of our scalable Kubernetes infrastructure. You can access Nextflow via the terminal.

For more information on how to create a JupyterLab environment and use the terminal see our JupyterLab environments and Terminal pages.

Once you have your terminal open you can access Nextflow as follows:

jovyan:~$ nextflow -v
[...]
nextflow version 25.04.8.5956

Tip

Your version may be newer than the one above. You should use this pre-installed version rather than downloading another version.

CLIMB Nextflow config defaults#

We have tried to make it as easy as possible to use Nextflow on CLIMB, and to make full use of available resources via our Kubernetes infrastructure. Out-of-the box, we set a number of configuration defaults.

Nextflow home is set to /shared/team/nxf_work/$JUPYTERHUB_USER
WorkDir is set to /shared/team/nxf_work/$JUPYTERHUB_USER/work
Executor is set to k8s (plus some supporting config)
/shared/team and /shared/public (read only) are mounted as PVCs to all Nextflow pods
A K8s ServiceUser is pre-mounted (no credentials setup required)
S3 bucket path-style access is enabled, with s3.climb.ac.uk set as the endpoint
S3 keys have also been injected from Bryn

How does Nextflow work with CLIMBs infrastructure ?#

The CLIMB infrastructure utilises Kubernetes also known as K8s, an open-source system for automating deployment, scaling, and management of containerised applications.

Nextflow has built in support for Kubernetes which allows for the execution of workflows in Kubernetes clusters.

Nextflow -> Kubernetes executor -> Kubernetes pod(s) created to run containers for each process in the workflow

TLDR: don't worry about it. Nextflow and Kubernetes run everything for you !

Each time a JupyterLab environment is launched, a pod will be created. Pods are also used when launching nextflow pipelines.

Once you have launched an environment you can track which processes are running through kubectl, the Kubernetes command line tool. It's pre-installed for you and pre-configured with credentials that map to your team. These credentials mean an isolated part of CLIMBs system is used specifically for your team. This means you can track all JupyterLab environment and Nextflow pods currently created.

Warning

When a teams vCPUs are maxed out, you will not be able to have any new pods for JupyterLab environments or Nextflow launched.

The easiest way to use Nextflow is to use pre-exiting pipelines such as those from nf-core.

Using nf-core#

nf-core is a community-curated set of analysis pipelines that use Nextflow.

We'll try running nf-core/rnaseq as an example, to demonstrate some features of how Nextflow is configured to work on CLIMB.

We'll be using the flag -profile test, which includes links to a test dataset provided with Nextflow. As a result, we'll only need to specify the --outdir flag for now.

Info

If there is a suggestion to update the version, ignore this !

jovyan:~$ nextflow run nf-core/rnaseq -profile test --outdir nfout
[...]
[-        ] process > NFCORE_RNASEQ:RNASEQ:ALIGN_STAR:BAM_SORT_STATS_SAMTOOLS:BAM_STATS_SAMTOOLS:SAMTOOLS_IDXSTATS -
[c5/3af707] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_QUANT (RAP1_UNINDUCED_REP1)                 [ 50%] 1 of 2
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_TX2GENE                                     -
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_TXIMPORT                                    -
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_SE_GENE                                     -
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_SE_GENE_LENGTH_SCALED                       -
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_SE_GENE_SCALED                              -
[-        ] process > NFCORE_RNASEQ:RNASEQ:QUANTIFY_STAR_SALMON:SALMON_SE_TRANSCRIPT                               -
[-        ] process > NFCORE_RNASEQ:RNASEQ:DESEQ2_QC_STAR_SALMON                                                   -
[9f/b3b437] process > NFCORE_RNASEQ:RNASEQ:BAM_MARKDUPLICATES_PICARD:PICARD_MARKDUPLICATES (RAP1_UNINDUCED_REP2)   [  0%] 0 of 2
[-        ] process > NFCORE_RNASEQ:RNASEQ:BAM_MARKDUPLICATES_PICARD:SAMTOOLS_INDEX
[...etc...]

Here we can see the pipeline is executing interdependent process in parallel, using the Kubernetes executor. This means rather than having everything running inside your JupyterLab environment container directly, new Kubernetes pods are being created on the fly and spinning up containers for each process.

To track all the Nextflow pods currently created just open another terminal tab and try running the following command:

jovyan:~$ kubectl get pods
NAME                                         READY   STATUS              RESTARTS   AGE
jupyter-demouser-2eclimb-2dbig-2ddata-2dd   1/1     Running             0          12m
nf-0e32425fc6d3dd42c9a3cbb8dd3ccc8c          0/1     Pending             0          4s
nf-6171723bfd88a03f1417e2aead99f180          0/1     Terminating         0          12s
nf-0e69c114a366b717ad115431277c01d7          1/1     Running             0          9s
nf-31f1f1f534f51863f2e19320ca7447e0          0/1     ContainerCreating   0          4s
nf-75dc1d907794e300ff2117a49be85c63          0/1     Pending             0          4s
nf-8491621dd73c44811c49bee448771ae5          0/1     Completed           0          13s
nf-9aad711fc9476c27e510b9402e3089d5          0/1     ContainerCreating   0          3s
nf-12656c698dd83e7dc2a31e3c88818227          0/1     ContainerCreating   0          4s
nf-a57472aaef0073c9e77a0a7e6001a849          0/1     ContainerCreating   0          3s
nf-f8d94c52e9120b7b8ba117d530f77c3a          0/1     Completed

You'll see your JupyterLab environment (jupyter-username-team) and all the pods that are currently running workflow process containers. The STATUS of each of the pods will change as they execute and then disappear.

Once the workflow has finished, run the above command again:

jovyan:~$ kubectl get pods
NAME                                         READY   STATUS    RESTARTS   AGE
jupyter-demouser-2eclimb-2dbig-2ddata-2dd   1/1     Running   0          17m
jovyan:~$

Now you're back to just your JupyterLab environment. You may also see those belonging to others in your team.

Where did my output data go?#

Once the workflow completes, in your first terminal you'll see something like:

-[nf-core/rnaseq] Pipeline completed successfully with skipped sampl(es)- -[nf-core/rnaseq] Please check MultiQC report: 1/5 samples failed strandedness check.-
Completed at: 16-Apr-2023 13:26:51
Duration : 5m 50s
CPU hours : 0.4
Succeeded : 196

We specified nfout as out outdir and you'll see the directory in the file browser on the left hand side of the JupyterLab interface. Take a look in nfout/pipeline_info/ inside your file browser. Try double clicking on the various HTML, YAML and CSV files here and you'll see that they open in new tabs for immediate reading.

Info

One thing to note here, when opening HTML files such as execution_report_[date].html, JavaScript is disabled in the tab by default. Right click the file and select Open in New Browser Tab from the context window to see the full report.

Where are Nextflow assets and temporary/intermediate (workdir) outputs stored ?#

By default, the CLIMB Nextflow config sets Nextflow 'home' to /shared/team/nxf_work/$JUPYTERHUB_USER. You'll see a number of subdirectories exist at that location, including assets, which will now contain the rnaseq workflow we just used, and work: where the intermediate outputs are located.

jovyan:~$ ls /shared/team/nxf_work/demouser.climb-big-data-d/
assets  capsule  framework  plugins  secrets  tmp  work

How to run NextFlow locally with Mamba#

When working with limited computing resources or tasks that don't require additional cores, launching new ones might not be the most efficient choice. In such cases, NextFlow offers a solution by allowing you to utilize the cores already assigned to your notebook for execution.

To achieve this, NextFlow provides the option to specify the Mamba profile and set the process executor to local. By doing so, you can optimize resource usage and minimize unnecessary overhead.

To run NextFlow with Mamba for nf-core pipelines, follow these steps:

nextflow run <your_nfcore_pipeline.nf> -profile mamba -process.executor=local

Replace <your_nfcore_pipeline.nf> with the filename of the nf-core Pipeline you want to execute. The options -profile and -process.executor should be specified to ensure proper configuration.

If Mamba encounters issues with older Pipelines, you can use the -profile conda option. However, note that this may be slower:

console
nextflow run <your_nfcore_pipeline.nf> -profile conda -process.executor=local

What is next ?#

Once you know some Nextflow basics, you can also try using the jupyter notebooks and RStudio.