Large scale Genomics with Nextflow and AWS Batch

Brendan Bouffler, Phil Ewels, Paolo Di Tommaso

Public clouds provide researchers with an unprecedented computing capacity and flexibility at cost comparable to on-premises infrastructures. This represents a big opportunity for modern scientific applications, but at the same time it poses new challenges for researchers who need to face with new standards, applications and infrastructures stacks and APIs that evolve at a fast pace. And for most researchers who are using computing as a tool to power their work, there’s a strong desire to avoid becoming sysadmins.

For this reason, it’s important for new tools to emerge that reduce complexity for scientists performing data analysis. Moreover, the increasing heterogeneity between cloud platforms and legacy clusters makes workflow portability and migration really important.

This presentation will give a quick overview how we manage to solve these problems with Nextflow along with AWS Batch. Nextflow is a tool developed by the Center for Genomic Regulation (CRG) for the deployment and the execution of computational workflows in a portable manner across cloud and clusters.

Nextflow shares with AWS Batch the same vision regarding workflow containerisation. The built-in support for AWS Batch allows the writing of highly scalable computational pipelines, hiding most of low-level platform-dependant implementation details. This approach enables researchers to migrate legacy tools in the cloud with ease and taking advantage of flexibility and scalability not possible with common on-premises clusters.