Recently the European Bioinformatics Institute (EMBL-EBI) (an outstation of the European Molecular Biology Laboratory) based at the Wellcome Genome Campus, has been running a course on what they are calling ‘ResOps’. By this they mean essentially the adoption of DevOps-style practices and methods to research computing, notably pushing the latter in more of an infrastructure-as-code, cloud-native direction, thereby hoping to improve reproducibility and scalability.
The course was delivered by Dario Vianello and Erik van den Bergh, from the Technology and Science Integration team at the EBI. All materials from the course including presentations and code were available online, at https://bit.ly/resops_june. The team had also provided temporary accounts for course delegates on their Embassy Cloud, on which we could run practical exercises.
Dario contrasted two possible approaches to moving a scientific pipeline to the cloud – mimicking on-site infrastructure versus a cloud-native approach. The former is clearly easier, but does mean that you will miss out on many of the benefits of running on a cloud. The latter requires a lot more effort initially and may not be feasible, depending on things like application support for object storage, but it does offer great potential advantages, not least of all being lower costs.
As well as taking us through the process of porting a pipeline to a cloud, there were three lightning talks showing concrete examples of existing and ongoing ResOps work. These comprised Marine Metagenomics, RenSeq and the Phenomenal project. One common theme here is that the porting work provided a very valuable opportunity to profile and optimise the existing pipeline, deriving detailed performance characteristics that in some cases weren’t available before. In the Metagenomics case, it was necessary to remove site-specific code and assumptions that made sense when running on local compute resources. It is also important to work on the scalability of research codes, and to test scaling up on clouds, because in some cases it may be that beyond a certain point adding more cloud instances will not increase performance. If working on a public cloud, this would mean paying more for no benefit. The PhenoMeNal project was particularly impressive in its adoption of modern technologies for the virtual research environment portal, while the RenSeq work comprised an interesting example of containerising a workload, and hosting the containers within VMs. They found that it was useful to include a specific reference genome in the Docker image, even though that meant a very large 9GB image size.
In the afternoon we worked on our own test application, using Terraform to create the cloud resources, and Ansible for configuration, all pulling from publicly available GitHub repositories.
I was very impressed with the thoroughness of the course and its realism. While recognising the advantages of cloud-native computing, they were realistic about the challenges of moving scientific pipelines in that direction, and provided valuable real-world experiences that will help making decisions about how and where to host those pipelines. The fact that the materials are completely open made it even more valuable. I thoroughly recommend taking the course to anyone interested in moving science to the cloud.