Challenges in Scaling STFC Cloud

Alex Dibbo and Martin Summers, STFC

The STFC Cloud provides compute resource for over 160 science projects across STFC. The cloud provides specialised analysis computing environments for ISIS science groups, and has provided 2.5 million core hours of Compute to X-Chem (DLS) in last 3 months alone. It delivers compute resource to UKAEA, EUCLID, eMERLIN, IRIS, UK-Tier-1, LSST and many others. In 2019/20, £2m was spent on hardware, which now includes more than 12,000 CPU cores, GPUs and storage. This talk will summarise development since 2015. Since then, demand has more than doubled each year. We consider the challenges faced in meeting this demand, including: o Networking: Managing throughput, resiliency and latency o Storage: Moving from deploying raw capacity to high performance storage at scale o Support: Optimising support with limited staff for the growing diversity of science use cases o Databases: Scaling for resiliency and throughput o Control plane: Scaling out to provide adequate throughput for operations o Managing growing user communities o Maintaining overall availability throughout the continuous improvements and deployments It will conclude with a summary of lessons learned, and a brief forward look.