Vivek Iyer, Wellcome Sanger Institute
The Hail platform built by The Broad Institute is a strong contender (if not the only contender) for dealing with genomic variation data at scale (10’s to 100’s of thousands of genomes) on conventional hardware. It rests on an Apache Spark cluster! The Broad typically recommend running it on Google’s Dataproc Spark provision for ease of use. We at Sanger decided first to roll our own cluster on an on-prem openstack cloud. We ran into bottlenecks for large-scale processing, retreated to GCP to establish what performance we _should_ be able to achieve (and, critically, to do some production work). Then – carried along by our systems groups – we modified the local clusters to be able to run at scale. Both local and public cloud platforms had their role to play (and keep playing) in this story, and I aim to sketch out this journey.