OpenStack Days UK 2017

OpenStack Days are community events with a mixed audience of operators, vendors and people interested in the cloud generally. They are organised independently in different regions around the world and the most recent UK edition took place in London on September 26th.

There were speakers from the OpenStack Foundation, from prominent suppliers of OpenStack services and from users, including representatives from the UK academic community. The opening keynote was from Thierry Carrez, VP of engineering at the Foundation discussing “The Four Opens” and how they apply to OpenStack:

  • Open source
  • Open development
  • Open design
  • Open community

He also addressed the increasing complexity in the OpenStack ecosystem and provided an excellent start to the day. The subsequent keynote talks from AVI Networks, Red Hat, Canonical and Mellanox addressed automation and networking.

The rest of the schedule was split across different rooms, and here are my impressions from the talks I attended. Two speakers from Huawei presented the latest developments of the Kuryr project in OpenStack which integrates Neutron with Docker and Kubernetes, to present network services to containers running in an OpenStack cloud, thereby allowing a more cloud-native approach. Kuryr uses the Dragonflow distributed SDN controller, which allows for good scaling (as demonstrated in tests with Redis) and now supports OpenFlow pipelines in OpenvSwitch. Interesting, it achieves these scaling improvements by running on the compute nodes themselves. Recent efforts have extended Kuryr to embrace Kubernetes, including a controller and CNI driver. Where possible, these components re-use as much as possible of the existing OpenStack infrastructure, such as Keystone for authentication/authorization and projects within Nova. When Kubernetes pods are running inside VMs, the speaker advised using trunk ports to avoid the performance penalty of double packet encapsulation. The next developments for Kuryr will include – scaling for controllers as well as for ports, multi-network support and performance improvements. This was a very useful overview, including reports of performance testing – a feature often missing in such presentations.

Next was Tim Cutts presenting on the work at Sanger on Secure Lustre with OpenStack. Their approach, given the large legacy of existing scientific pipelines that make conservative assumptions about the infrastructure on which they’re running (e.g. POSIX file access), is to regard traditional HPC clusters and cloud computing as complementary, with the latter providing a flexible compute environment. Those workloads need to be supported as the level of cloud provision increases, giving time for them to be re-written with a cloud-native architecture. While they now have 14PB of Ceph storage (4.5PB usable), they needed to be able to use the very large datasets hosted on local Lustre resources, and to provide good security isolation between tenants when doing so. Tim described their work, which relied on new features only available since Lustre 2.9. Using one bare-metal Lustre router shared amongst all the tenants, they achieved 3GB/second aggregate performance across multiple clients. For greater isolation, they have also implemented a separate virtualised router for each tenant. Some wrinkles they found are that they needed to turn off port security, which is an acceptable tradeoff in a fairly tightly controlled environment, and that there is some inadvertent asymmetric routing by Lustre, which implies it does not check the origin of packets sufficiently. More details of their work are available at and there is a video of a similar talk at ISC17.

Prometheus has been adopted as a supported project for monitoring by the Cloud Native Computing Foundation, and it is often discussed in connection with Kubernetes. I was interested to hear about monitoring OpenStack with Prometheus, in a talk by Csaba Patyi of Component Soft. OpenStack is composed of many services, each with their own logs, making the logging and monitoring situation quite complex. Csaba demonstrated how to use the well-established Elastic Stack for OpenStack. This works well, is easily configurable and does not require Logstash. However, it can be hard to separate data from metadata and extra work is needed to handle multi-line logs, which are quite common in OpenStack. He showed how you can manipulate logs based on their information structure, which enables useful views in Kibana and for the creation of dashboards based on log information. The code for the demo environment used in the talk is available on GitHub. For monitoring and alarming he discussed Prometheus, which was created by design for cloud environments. Advantages he mentioned are that there are lots of Prometheus data exporters for OpenStack, and that it can be configured to search via DNS for new hosts, rather than relying upon static configuration.

My first talk in the afternoon session was on High Availability, by Kenneth Tan of Sardina Systems. Kenneth began by highlighting the new expectations that users have, based on their exposure to public cloud services such as AWS, and by pointing out the different perspectives that consumers of services have, as opposed to operators of those services. The first safeguard needed for high availability is to take an infrastructure-as-code approach, allowing for easy redeployment, whereas a simple safeguard for data is to use replication. He discussed the differences between extrinsic and intrinsic ‘death risks’ and the need to detect and distinguish what he termed ‘sick states’ e.g. when a node is affected but not dead – it is ‘unhealthy’. He emphasised the importance of looking for correlations and causality in monitoring data, and in looking for anomalies. Pushed data streams are better for scaling than polling, he suggested. The infrastructure implications of storing metrics and logs were listed, along with the need for systems that can handle unbalanced I/O patterns i.e. large numbers of small writes, along with a small number of very large reads. With good monitoring you should be able to predict imminent faults, and deal with latent threats, before they become patent threats.

OpenStack as a project has not been immune from the remorseless spread of containers, and one of the more interesting aspects for me has been the effort to allow running OpenStack services as containers. Steve Hardy from Red Hat gave a talk entitled “Deploying OpenStack at Scale with TripleO, Ansible and Containers” which discussed the changes in the TripleO project to accommodate exactly such containerisation of services, and to make more use of Ansible in general for deployments. TripleO creates a small OpenStack installation (“the undercloud”) which is then used to install the main installation (“the overcloud”). Historically it largely used Puppet for all the installation tasks, and lacked some flexibility, which made it harder than it should have been to customise deployments to meet local requirements. Node roles in TripleO are now composable and the Mistral project effectively provides an API for TripleO as a whole. One of the benefits of containerisation includes dependency isolation, which makes it much easier to roll backwards and forwards with different versions. TripleO has been collaborating with the Kolla community on this effort. He also mentioned the Paunch tool which manages the containers used by TripleO.

Link to talk PDF.

The final talk I attended in the technical tracks was by Julien Danjou, who is one of the developers in the Gnocchi project. Gnocchi was created when it became clear that Ceilometer lacked the performance required for time-series data, in large part because it was originally designed for billing and included a lot of flexibility that was irrelevant for monitoring. Some interesting features of Gnocchi are that it computes metric aggregations itself, that it can batch measurements together and send them in a single HTTP request, and horizontal scaling is achieved by simply adding more nodes. He referred to recent performance testing in Gnocchi version 4, described on his blog at It’s an interesting alternative to the more generic monitoring approaches listed above.

Link to talk PDF.

The final talk was by Jonathan Bryce, who is the Executive Director of the OpenStack Foundation. This was quite informal and in part a Q&A, particularly useful for people new to the OpenStack community.

It was a packed day and as always I could only see a fraction of the talks I wanted to attend. A convenient way of catching up with the latest developments, seeing how people are using OpenStack, and meeting familiar faces in the community.

EBI ResOps Course 3rd July 2017

Recently the European Bioinformatics Institute (EMBL-EBI) (an outstation of the European Molecular Biology Laboratory) based at the Wellcome Genome Campus, has been running a course on what they are calling ‘ResOps’. By this they mean essentially the adoption of DevOps-style practices and methods to research computing, notably pushing the latter in more of an infrastructure-as-code, cloud-native direction, thereby hoping to improve reproducibility and scalability.

The course was delivered by Dario Vianello and Erik van den Bergh, from the Technology and Science Integration team at the EBI. All materials from the course including presentations and code were available online, at The team had also provided temporary accounts for course delegates on their Embassy Cloud, on which we could run practical exercises.

Dario contrasted two possible approaches to moving a scientific pipeline to the cloud – mimicking on-site infrastructure versus a cloud-native approach. The former is clearly easier, but does mean that you will miss out on many of the benefits of running on a cloud. The latter requires a lot more effort initially and may not be feasible, depending on things like application support for object storage, but it does offer great potential advantages, not least of all being lower costs.

As well as taking us through the process of porting a pipeline to a cloud, there were three lightning talks showing concrete examples of existing and ongoing ResOps work. These comprised Marine Metagenomics, RenSeq and the Phenomenal project. One common theme here is that the porting work provided a very valuable opportunity to profile and optimise the existing pipeline, deriving detailed performance characteristics that in some cases weren’t available before. In the Metagenomics case, it was necessary to remove site-specific code and assumptions that made sense when running on local compute resources. It is also important to work on the scalability of research codes, and to test scaling up on clouds, because in some cases it may be that beyond a certain point adding more cloud instances will not increase performance. If working on a public cloud, this would mean paying more for no benefit. The PhenoMeNal project was particularly impressive in its adoption of modern technologies for the virtual research environment portal, while the RenSeq work comprised an interesting example of containerising a workload, and hosting the containers within VMs. They found that it was useful to include a specific reference genome in the Docker image, even though that meant a very large 9GB image size.


PhenoMeNal architecture

In the afternoon we worked on our own test application, using Terraform to create the cloud resources, and Ansible for configuration, all pulling from publicly available GitHub repositories.

The EBI Cloud Portal

I was very impressed with the thoroughness of the course and its realism. While recognising the advantages of cloud-native computing, they were realistic about the challenges of moving scientific pipelines in that direction, and provided valuable real-world experiences that will help making decisions about how and where to host those pipelines. The fact that the materials are completely open made it even more valuable. I thoroughly recommend taking the course to anyone interested in moving science to the cloud.


Cloud Workshop

The Francis Crick Institute

It’s been a few months since our November workshop so there’s been some time to digest and reflect on some of the common themes emerging. Having attended a couple of other conferences and workshops from my community (AGU and AMS) it’s interesting to compare.

Firstly, it was great to see such a variety of application areas represented. For this our second annual workshop, we opened it for the submission of abstracts and this made such a difference.  There was a great response, Life sciences having the margin on other domain areas.  We had 160 register and 120 on the day.   It was fantastic to have the Crick as a venue.  It worked really well.

The first session looked at applications of hybrid and public cloud.   Two really interesting use cases (Edinburgh and NCAS, NERC) looked at trying out HPC workloads on public cloud.  This raised issues around comparative performance and costs between public cloud and on-prem HPC facilities.

On AWS, Placement Groups allow instances to put close to one another to improve inter-node communication for MPI-based workloads.   This showed comparable performance with Archer (UK national supercomputer) for smaller workloads but clearly there was some limit as this tailed off as the number of nodes increased whereas Archer performance continued linearly with increase in scale. This tallies with what I’ve seen anecdotally at the AMS conference where there seemed to be on the one hand increasing uptake of public cloud for use with Numerical Weather Prediction jobs (which need MPI). However, this seemed to be being done for smaller scale workloads where they can stay within the envelope of the node affinity features available.

Another theme was portability – what kind of approaches can be used engineer workloads so that they can be easily moved between providers. Andrew Lahiff from STFC, presented a very different use case showing how container technologies can be used to run for Particle Physics, cases where there the focus is high-throughput rather than HPC requirements and so much more amenable to cloud.  This work has been done a part of a pilot for the Cloud Working Group to specifically investigate how containers and container orchestration technology can be used to provide an abstraction layer for cloud interoperability.  A really nice slide showed Kubernetes clusters running on Azure and Google cloud managed from the same command line console app.  Dario Vianello’s talk (EMBL-EBI) showed how an alternative approach using a combination of Ansible and Terraform can be used to deploy workloads across multiple clouds.

Microsoft’s Kenji Takeda presents on recent Azure developments

It was great to have talks from hyper-scale cloud providers AWS, Azure and Google.  The scale in hyper-scale is as ever impressive as is the pace of change in technology:  very interesting to see Deep Learning applications driving the development of custom hardware – TPUs and FPGAs. Plans underway to host data centres in the UK will ease uptake. OpenStack Foundation and Andy McNab‘s talks showed examples of federation across OpenStack clouds.

In the private cloud session, Stig Tefler gave a nice illustration of network performance for VMs showing how a number of aspects of virtualisation can be changed or removed to progressively improve network performance towards line rate. Alongside talks on private cloud, the parallel session looked at Legal, Policy and Regulatory Issues a critical issue for adoption of public cloud.  Steven Newhouse gave some practical insights from a recent cloud tender for EMBL. There is clearly need for further work around these issues so that the community can better informed about choices. This is something that the working group will be taking forward.

For this workshop, we experimented with an interactive session – bringing together a group of around 20 delegates to work together on some technical themes agreed ahead of time including bulk data movement and cloud-hosting of Jupyter notebooks. There was plenty of useful interaction and discussion but we will need to look at the networking provision for the next time to ensure groups can get on with technical work on the day.

We discussed next steps in the final session. There is clearly interest in taking particular areas forward from the meeting: focus groups on technical areas like HTC and use of parallel file systems with cloud or organised around specific domains within the research community. Training figured also, in the form of a cloud carpentry course so that researchers can more readily get up and running using cloud. Looking forward, in each of these case we’re looking for discrete activities with an agreed set of goals and something to deliver at the end. Where possible we’re seeking to support relevant work that is already underway and initiate new work where there are perceived gaps. We will be looking at running smaller workshops targeted at specific themes in the coming months as a means to engage and disseminate some of this work.

Phil Kershaw, STFC & Cloud-WG chair