This session included presentations for a range of applications run on public and hybrid environments exploring issues such as portability across providers, HPC and costs.
In the first, Andrew Lahiff presented on work trailing use of container and container orchestration technologies on different public cloud providers. This is part of a technical pilot initiated by the Cloud Working Group to look into the use of such solutions as a means to provide an interoperability layer between different cloud providers.
The CMS experiment was picked as a source of research domain specific workloads. The LHC computing model is set up already for highly distributed infrastructure that may fail. So jobs are created centrally and then the job pulled down to be computed. Jobs need to fix into this framework. Experiments have been done across many cloud environments (private & public) and have run at scale.
However, putting the services into containers and then running through the Kubernetes orchestration framework is new. There’s still a need for further technical developments coming through Kubernetes (e.g. security) but there are also interesting features (e.g. the ability to federate across multiple cloud platforms). CMS Monte Carlo production ran OK across RAL, Azure and GCP. More data-intensive workflow are starting to be run on Azure and GCP as the next steps.
Q & A
Price comparison is not the primary goal of this study – more to see if the work is portable.
At the moment the plan is to stream data directly to the running nodes. Staging the data into the cloud is an area to explore in the optimisation phase. Not looked at egress costs.
It is not clear what the source of the failures is – not site specific.
Both of the next presentations explored experiences running HPC-like workloads on public cloud. The first, Simon Wilson’s was a study to see how the the climate models will run on AWS (HPC) and in the future on Azure (HPC). The code is based around Fortran and MPI scaling out to 1000 cores. The model is a massively parallel based upon a grid model. Comparisons were made running on Archer (national HPC), Lotus (JASMIN NERC/STFC) and AWS running a small and large models across the systems. For the smaller model and <200 cores the scaling is equal across all systems, but Archer scales to higher number of cores. For the larger model, scales up to 500 cores on AWS and Archer, but the Archer scales up to higher number of cores which is a more realistic use case.
At no point was AWS cheaper than running on Archer – AWS ~50% more expensive than Archer and Archer scales better. Local IO performance was eliminated as a performance issue. AWS using 10Gb ethernet with relatively high latency. Data transferred back from AWS to STFC reached 150MBs.
Q & A
Can you use spot prices with this application – yes there is checkpointing in the application. Not using spot pricing would be x4 prices.
Why calculate costs when TCO costs provided on the website? Because we wanted to!
Scalability is bound by the performance of the network
Will look into optimisation around the use of spot pricing.
Do you use accelerators? No tin this generation of model.
Used SSD storage during the run and EBS for storage.
There was an impact of BREXIT on the AWS costs.
The second example HPC workload is HemeWeb. Again the software was run on Archer and on AWS. The software is hard to build and needs a parallel machine to do the computation. This complex process could be better accessed as a ‘Science as a Service’ model by capturing the software within a Docker container and distributed through Docker Hub. NASA has benchmarked 4 HPC apps on AWS and Infiniband. Single node the same but scalability worse on AWS. A web based interface allows specification of the job and submission to AWS. All the data is stored in the Docker image so is repeatable. AWS is slightly better for single node performance, but scalability becomes an issue at 120 cores. AWS approx 3 times more expensive (on demand) than Archer’s external charging rate. AWS can be bought on demand while Archer requires submission of an application for access to it.
The last presentation looks at hybrid cloud to allow jobs to be run on-premises as well as public cloud. This is needed to help tackle some of the data growth challenges at EBI. EBI want to be able to use multiple clouds (including internal clouds). Could focus on moving pipelines (the workload) into the cloud, or to move the infrastructure into the clouds so the pipeline does not need to be changed (much). Terraform and Ansible have been used to abstract differences and deployment models between clouds. – need to get the science running on the clouds! The team worked with researchers to get both infrastructure and the pipeline described so it can be deployed by tools. To reduce the pipeline deployment time switched to building images and deploying these. Need to establish a ResOps (not DevOps) environment and to have this adopted by the researchers so that the code is generated here.
Q & A
Cost comparison is a concern… but there will be an activity within the NeI to publish costs so transparent compatisons can be made,
Is the barrier to accessing public clouds down to costs or technical infrastructure? For fluid simulations need 100+ cores and good networking to get good turnaround of results… AWS OK for smaller models.
Lack of scalability in AWS is leading to higher costs.
Notes from Steven Newhouse:
Q & A
Is it legal to still use OpenLava due to recent lawsuit? Could migrate to other clusters.
Where should you start now? Supporting two cloud models at the moment.