January 2018 workshop – session titles and abstracts

Talk title Speaker Abstract
Session 1 – International Collaborations
Future Science on Future OpenStack: developing next generation infrastructure at CERN and SKA Stig Telfer, StackHPC The next generation of research infrastructure and large scale scientific instruments will face new magnitudes of data.

This talk presents two flagship programmes: the next generation of the Large Hadron Collider (LHC) at CERN and the Square Kilometre Array (SKA) radio telescope. Each in their way will push infrastructure to the limit.

The LHC has been one of the significant users of OpenStack in scientific computing. The SKA is now working to a final software architecture design and is focusing on OpenStack as an underlying middleware function.

Together, they plan to develop a common platform for scaling science: to accommodate new applications and software services, to deliver high ingest rate real-time and batch processing, to integrate high performance storage and to unlock the potential of software defined networking.

EOSC-hub: overview and cloud federation activities Enol Fernández, EGI The presentation will provide an overview of the H2020 project EOSC-hub and of how it will contribute to the implementation of the European Open Science Cloud initiative with a focus on the cloud federation activities in the project. The EOSC-hub project creates the integration and management system (the Hub) of the future European Open Science Cloud that acts as a single contact point for researchers and innovators to discover, access, use and reuse a broad spectrum of resources for advanced data-driven research. EOSC-hub builds on existing technology already at TRL 8 and addresses the need for interoperability by promoting the adoption of open standards and protocols. IaaS clouds will be incorporated into the EOSC-hub baseline services to provide the computing infrastructure that supports both generic advanced-services and community specific thematic-services.This talk will also provide an overview on how the EGI Cloud Federation can be used as a blueprint for the federation of IaaS and the plans for extending this blueprint to support the federation of PaaS and SaaS services.
Public Clouds, OpenStack and Federation Ildikó Vancsa, OpenStack Foundation The presentation will briefly introduce the Passport program launched by the Public Cloud Working Group in OpenStack to start exploring how it can be leveraged for research and academic use. Beyond this the session will also highlight activities in the area of federating research clouds around the globe.
Session 2a – Technical Challenges – Containers, portability of compute, data movement
Running a Container service with OpenStack/Magnum Spiros Trigazis, CERN In this talk, I’ll be presenting the latest features ofOpenStack/Magnum for the Pike and Queens OpenStack releases. You can expect to learn which versions areof the popular container orchestrators you can provideto users as a cloud provider and what Magnum providesto you as an Cloud operator. In the second part, we willlook into CERN’s container service, how magnum was adaptedfor the CERN cloud and which are the daily challenges thatthe CERN cloud team faces to satisfy the needs of the HEPcommunity. The focus will be around cutting dependenciesoutside the a private cloud, network architecture forcontainer clusters, security architecture for applicationsand monitoring the container orchestrator and the applicationin tens or hundreds of clusters.
Large scale Genomics with Nextflow and AWS Batch Brendan Bouffler, AWS Brendan Bouffler, Phil Ewels, Paolo Di Tommaso

Public clouds provide researchers with an unprecedented computing capacity and flexibility at cost comparable to on-premises infrastructures. This represents a big opportunity for modern scientific applications, but at the same time it poses new challenges for researchers who need to face with new standards, applications and infrastructures stacks and APIs that evolve at a fast pace. And for most researchers who are using computing as a tool to power their work, there’s a strong desire to avoid becoming sysadmins.
For this reason, it’s important for new tools to emerge that reduce complexity for scientists performing data analysis. Moreover, the increasing heterogeneity between cloud platforms and legacy clusters makes workflow portability and migration really important.
This presentation will give a quick overview how we manage to solve these problems with Nextflow along with AWS Batch. Nextflow is a tool developed by the Center for Genomic Regulation (CRG) for the deployment and the execution of computational workflows in a portable manner across cloud and clusters.

Nextflow shares with AWS Batch the same vision regarding workflow containerisation. The built-in support for AWS Batch allows the writing of highly scalable computational pipelines, hiding most of low-level platform-dependant implementation details. This approach enables researchers to migrate legacy tools in the cloud with ease and taking advantage of flexibility and scalability not possible with common on-premises clusters.

Best practice in porting applications to Cloud Dario Vianello, EMBL-EBI Cloud computing is now everywhere, and by many heralded as the solution to most – if not all – the compute and storage needs across the board. But how true is this, especially in Science? Should we abandon on-premise datacenters and transfer years’ worth of efforts to cloud based environments? Or should these two be integrated together to exploit the best of the two worlds? EMBL-EBI has been piloting adoption of cloud resources at many level of its operations, ranging from shifting entire workloads into independent “compute islands”, to hybrid scenarios where cloud compute becomes integral part of our on-premise resources, and all the way down to disaster recovery. This has required to adapt – or define from scratch – policies to cope with these new scenarios, in particular considering the non-trivial issues around – quite obviously – data privacy and procurement. This also helped defining best practices in porting Science applications, which are now at the basis of our Research Operations (ResOps) training, because boarding the cloud requires new concepts and practices to take full advantage of the benefits it can offer. Most pipelines – and the infrastructures underpinning them – will require to be reworked to fully unlock the benefit of this fundamentally different environment where flexibility is everything and efficiency is key to a reasonable and sustainable bill. This presentation will provide insights on the lessons we’ve learned – and taught – in our efforts to reach out for the clouds, and our experience in EU projects such as “Helix Nebula – The Science Cloud”, which will soon deliver important results on how commercial clouds can be procured, and exploited.
Demystifying Hybrid Cloud with Microsoft Azure Mike Kiernan, Microsoft This session will cover a handful of example real-world scenarios and technologies for migrating research workloads into Azure in a hybrid setting.
Session 2b – Practical challenges
Aerospace and Cloud Computing Leigh Lapworth, Rolls Royce Cloud computing offers numerous advantages to the aerospace sector but the rate of adoption is driven by the need to have (a) regulatory compliance, (b) robust security models and (c) reliable economic forecasting models. This presentation will discuss some of the requirements for running simulation workflows in the cloud.
Processing patient identifiable data in the cloud – what you need to consider technically and process wise to keep your data safe. Peter Rossi, UKCloud
Jisc ExpressRoute Circuit Service David Salmon and Gary Blake, Jisc ExpressRoute allows Jisc members to extend their on-premises networks into the Microsoft cloud over a virtual private circuit facilitated by Jisc. The presentation serves to outline the features of the product and a technical overview of the connectivity models.
The Janet End-to-End Performance Initiative Duncan Rand, Jisc We are seeing a growth in data-intensive science applications and flows. Whilst the current state of the art involves the transfer of large datasets between sites at up to 30 Gb/s there are many researchers who are frustrated by an inability to transfer data at an acceptable speed. The Janet End to End Performance Initiative aims to help sites and researchers improve data transfer and so make optimal use of their Janet connectivity. I will describe the Initiative and some typical use cases.
Session 3a Innovative applications, usability and training
Visualizing Urban IoT data using Cloud Supercomputing Nick Holliman, Newcastle University In the last year the commercial cloud has begun to provide access to high performance visual computing on demand via IaaS GPU provision. This allows research groups to plan and use supercomputing scale resources in visualization projects where previously it would have been unaffordable to do so. We will describe and demonstrate our pilot terapixel visualization project the Terascope and argue that the cloud opens a new research path to explore a range of novel, accessible visualization techniques.
Accelerate time-to-insight with a serverless big data platform Hatem Nawar, Google Cloud
Azure at the Turing Martin O’Reilly, Turing Institute An overview of how The Alan Turing Institute is using Azure to support its research programme. The talk describes how we use Azure at the Turing and discusses the benefits and challenges of using the cloud for our primary institutional level compute resource. It discusses how we support researchers in using Azure and how we hope to make cloud computing easier for our researchers in future.
HPC – There’s plenty of room at the bottom Mike Croucher, University of Sheffield Despite an increasing reliance on software, large datasets and complex simulations, traditional HPC is irrelevant for most researchers. In this talk I will discuss why this is the case, explore options for the future and discuss what Research Software Engineers need from computational substrates.
Session 3b – Virtual Laboratories and Research Environments
CLIMB Thomas Connor, Cardiff University / Nick Loman, Birmingham University The large datasets that are routinely generated by modern biological instruments frequently outstrip the analytical and storage capacity available to microbiological research groups. Despite investment in national cloud resources, many biologists, even those familiar with data analysis, find that despite the power/flexibility of cloud, using cloud resources introduces new complexity to their analyses (e.g. around provisioning VMs and installing software stacks).Within the MRC funded CLoud Infrastructure for Microbial Bioinformatics (CLIMB) we have sought to mitigate the complexities of cloud by developing a launcher that enables researchers to rapidly provision preconfigured instances on our federated cloud infrastructure, on demand. Our default preconfigured instance is the Genomics Virtual Laboratory, providing users with a personal research gateway, preconfigured with web services including Galaxy, Jupyter Hub and a range of command line tools. In addition, we also host community resources (such as EDGE) and provide tools for users to develop, and share, their own VMs within the system. In this talk we will introduce our system and our service, with a particular focus on how our launcher and the GVL have driven adoption of CLIMB by the community – to the extent that we now support over 290 research groups across the UK.
CyVerse UK: a Cloud Cyberinfrastructure for life science Alice Minotto, Earlham Institute Minotto A.1, Davey RP.11 Earlham Institute, Norwich Research Park, Norwich, NR47UZ

CyVerse UK aims to provide life science researchers all over the world HPC access, with a geographical advantage to users located in the UK or EU, while tackling many known issues in bioinformatics, as need for training, difficulties in reproducibility of analyses, problems with dependencies and installation processes.The core CyVerse cloud consists of a HTCondor pool, which includes 12 worker nodes and 3 submit machines, entry points for different projects. The use of the existing integration between Condor and Docker provides increased flexibility, decoupling the application development from the underlying infrastructure. We also hope the use of Docker will encourage both software developers and final users to contribute actively to the growth of the project.CyVerse UK was born as a branch of CyVerse, a now mature US project. This collaboration is reflected, on a side, in the efforts to federate the Data Storage of the two projects through iRODS, on the other side, in the work being done to allow flocking from the US to the UK condor pool, that will allow jobs to run in the most favourable location.The CyVerse Openstack cloud also hosts other projects, with the purpose of enabling interoperability and sharing of data.

EBI Cloud Portal Jose Dianes, EMBL-EBI The traditional pipeline model for biological data is being challenged due to the increasing size and increasing geographical distribution of the data. The massive size of datasets produced by advances in technology and the need for analysing data coming from different sources is making it harder to run pipelines in-house at scale.Cloud computing addresses many of these issues. By leveraging additional resources on demand, we can deal with scalability problems when they occur. Moreover, when having data stored on the cloud, we can deploy a pipeline close to it, effectively moving compute to data and avoiding all the problems associated with the transfer of datasets over long distances and across legal jurisdictions.The EBI Cloud Portal is being developed to provide scientists and cloud experts with a platform where computational biology becomes scalable, reproducible, and abstracted away from the complexity associated to different cloud providers. It provides an application model where developers can focus on these challenges in order to make cloud-ready applications available for scientists to use on the cloud. Any cloud provider can be used given the right credentials and configurations, these being either provided by the scientist themselves or shared within organisational teams.
Data Labs: A Collaborative Analysis Platform for Environmental Research Jamie Downing, Josh Foster, Tessella Data Labs are a new collaborative platform built in the unmanaged JASMIN cloud. They are designed to make it easier for NERC researchers to access collaborative analysis tools. In this brief introduction we will show: what a Data Lab can do now; the architecture and technologies that make this possible; and our roadmap of new features.
Session 4a – Technical Challenges – batch compute on cloud
Matching to cloud technologies to Theoretical Astrophysics and Particle Physics applications Jeremy Yates, UCL We present a software engineering methodology for determining which application types in Theoretical Astrophysics and Particle Physics are best suited to the hardware and cloud technologies offered by Cloud providers. This study, funded by STFC DiRAC, will look at the suitability of cloud technologies provided by Openstack, Azure, Singularity and Containers for applications that are embarassingly parallel, loosely parallel, weakly parallel and strongly parallel.
Hybrid HPC – on-premise and cloud. Wil Mayers, Alces Flight At first it was cloud vs. on-premise HPC… but what about a balance between the two?
Running HPC Workloads on AWS using Alces Flight Igor Kozin, ICR Public clouds have democratised HPC. Anyone can start a cluster in a cloud using tools such as Alces Flight with minimal knowledge of how it is done and quickly get a familiar HPC environment. While starting a personal HPC cluster on demand brings a great benefit, the user is typically required to have elevated privileges to perform this and with privileges come responsibilities which the user may not want. We describe a simple solution for such situations which requires only write and read permission to a user folder on S3. By creating a simple configuration file, a user can start a pre-configured Alces Flight cluster and then kill it just as easily. We will describe our implementation which is done using S3 triggers and AWS Lambda.
OpenFOAM batch compute on AWS James Shaw, Reading University Our research group writes atmospheric model code using OpenFOAM, an open source computational fluid dynamics C++ library. Amazon web services form a key part of our computational workflow: atmospheric simulations run on EC2 virtual machines inside Singularity containers, and results are uploaded automatically to S3 storage. A code commit to GitHub triggers a build that compiles, tests, and deploys Debian packages to a web repository backed by S3 storage. This way, code updates are distributed to atmospheric modellers by the standard system update process.We’re interested to learn from other groups about their computational workflows. How can work be scheduled across a cluster of cloud compute nodes? Can compute nodes be started and stopped dynamically? And how do we grow interest and support amongst researchers to create better, automated computational workflows?
Session 4b – Technical Challenges – Storage
Semantic Storage of Climate Data on Object Store Neil Massey, NCAS / Centre for Environmental Data Analysis, STFC Using object stores, and cloud based storage utilising the Amazon S3 HTTP API, presents a number of opportunities and challenges in the storing of climate data. The distributed nature of the objects allows large data sets to be broken into “fragments”, each fragment containing a subset of the data. This allows for parallel access to the fragments, improving the performance of reading the data across a network. However, this presents a number of problems. Firstly, determining the fragment size and the optimum method of splitting multi-dimensional data. Secondly, enabling meaningful search of data, when the data may be widely distributed.This poster will present a new method of splitting netCDF files into fragments and storing each fragment as an object in an S3 compatible object store. The location of the fragments, the metadata and dimensions for each climate variable are stored in a master file, which can be written to a location not within the object store, for example a POSIX file system on a SSD. This allows fast search of the metadata without having to reassemble the fragments. Each fragment is stored as a self-contained netCDF file allowing reconstruction of the data if the master file is lost.
Accessing S3 with FUSE – Jacob Tomlinson, Informatics Lab Jacob Tomlinson My data lives in an object store, but my tools expect a POSIX file path, what do I do? In the Informatics Lab we have been experimenting with using FUSE filesystems on a distributed compute cluster to provide parallel access to files on AWS S3. We have also created a library called pysssix which provides a slim and intentionally minimal access to S3 for fast data access.
OpenStack Manila John Garbutt, StackHPC As part of the Square Kilometre Array (SKA) project we have been exploring OpenStack Manila for Filesytem-as-a-Service with Ceph and Gluster. I would like to discuss what we have found.
Providing Lustre access from OpenStack Thomas Stewart / Francesco Gianoccaro, Public Health England Providing access to Lustre from OpenStack (using Lustre via external network for single tenancy in production with PHE’s on premise HPC Cloud)
Implementing medical image processing platform using OpenStack and Lustre Wojciek Turek, Cambridge University In this talk we will describe our experience and challenges of integrating HPC Lustre storage with OpenStack for a production image processing platform at Wolfson Brain Imagine Centre