Sam Proctor, Wellcome Sanger Institute
During the COVID-19 pandemic the Wellcome Sanger Institute has been responsible for providing PHE/UKHSA with timely information on covid lineages. In order to provide the most accurate picture of the pandemic, all samples required frequent re-processing as new lineages were detected. As a proof-of-concept we implemented a hybrid pipeline using Airflow and the AWS cloud to enable at scale processing of all samples, resulting in processing times an order of magnitude less than that of using local infrastructure. Airflow was used to orchestrate tasks running locally whilst AWS Step Functions were used to manage tasks running in the cloud, this combination worked well in practice. Architected to ensure sensitive data remained on local infrastructure. Use of the AWS CDK to author the cloud stack in C# and Python allowed for fast development and ease of environment separation. Docker Images allowed us to use existing code, rapidly deployed into AWS. Whilst making extensive use of AWS Lambda and AWS Fargate eliminated the need to manage clusters. Discussed are the lessons learnt from this project and the benefits that we have seen. It serves as a useful reference for those wishing to undertake a similar project.