This week we held a technical workshop with a small, but dedicated group of people. We’ve heard that our large 1 day annual workshop is great, but there are areas where people wanted to have a more in-depth discussion.
The workshop was built around “un-conference” format, where we had two pre-arranged talks but the rest of the day was open for discussion – though we were guided by the topics people suggested as part of the registration process.
The morning session kicked off with some high level discussion of possible topics followed by one of our invited talks – Stig Telfer from StackHPC gave a talk on some performance work they’ve been doing with Ceph to support CRAY and the Human Genome project. It was interesting to note the difference moving to bluestore made to Ceph performance and also how LVM and partition use of the NVMe devices had a huge impact on performance of the storage system.
The slides for the talk are available online.
The morning session then split into three discussion topics:
OpenStack
We couldn’t run a cloud workshop without people wanting to have at least some focus on OpenStack, and quite specifically OpenStack vs other private cloud platforms. There was some discussion on going from single node clusters to rack scale as well as operational upgrades to OpenStack. Generally it was found that there were three key ways of splitting the infrastructure, single shared cluster, managed services cloud (no root) and “public” cloud (user gets root).
There was also quite a lot of discussion on how people accessed pre-defined stacks for workload, with everyone almost seemingly using a different technology to achieve this!
Containers and Kubernetes
One of the big topics discussed in the container group was resource management, i.e. there is no batch compute usage type way of managing container deployments. There was also discussion on security challenges of containers and the different technology solutions (e.g. Docker, singularity etc). This included how to provide secure access to shared storage and also some challenges around multi-tenanted environments – this included discussion on how people build their containers and what privileges are required to do this.
Automation/deployment
This topic probably overlapped a lot with the other two morning topics as people discussed ways and technologies of deploying as well as one of the some of the big challenges, including how to upgrade the underlying automation technology (with specific focus on Kubernetes) as well as discussion on how portable workloads and deployment tools actually were between cloud providers.
After lunch we reconvened with another invited talk – this time from Martin O’Reilly (Alan Turing Institute) on what their $1m annual donation from Azure gives them. My biggest take-away from this was how expensive cloud storage is at scale, i.e. in the PB scale. Again the slides are available online.
We then followed up with a round of rapidly developed lightning talks, each 5-10 mins:
- Jacob Tomlinson (MetOffice Informatics Lab) – Pangeo
- Aiman Shaikh (STFC) – Containers for HPC
- John Garbutt (StackHPC) – Demo from Berlin – OpenStack Magnum, Manilla, k8s
- Matt Pryor (STFC) – JASMIN cloud Portal: simplifying provisioning for scientists
The group then broke again into three topics (though there was quite a lot of side discussion over coffee!)
Storage
Storage always seems to be a perennial discussion in these forums, with some disucssion over data transfer and also the age old discussion of POSIX vs object (or maybe some key value store?), as well as discussion on running traditional HPC storage software in the cloud vs fuse type access to s3.
Discussion also covered performant ways to access Ceph storage, running databases on cloud storage and also the tricky concept of authenticating access to storage in a cloud enabled world.
The topic of charging also was touched on with some discussion of people’s experiences of cloud egress charging and the waivers in place with a number of cloud providers.
Workflows and Cluster as a service
This discussion talked about using VPNs to connect from on premises clusters to public “bursted” clusters and how these are often unreliable or slow, as well as some issues with having the right ports open to bring up VPN links.
Quite a lot of discussion was also had on different workflow types and their appropriateness to cloud bursting, for example bio sciences applications tend to be long pipelines of transformations with each step having different resource requirements.
Managing costs, monitoring with public cloud
Various people discussed their experience with cost control (or lack of) in various cloud providers as well as some discussion on whether researchers should worry about the cost of their compute jobs and that cloud usage is a very clear way of demonstrating the cost to researchers. There was also some interesting discussion on whether it was possible to use APIs with the spot market across various providers to very carefully target jobs to take advantage of lowest spot pricing.
You can view the Etherpad used for the day.
Overall the day seemed to go well, and the general consensus was it was useful, so its something we’ll probably look to arrange again in the future!