Programme for UKRI Cloud Workshop 2022

The workshop was held on March 29th at the Francis Crick Institute in central London.

Programme now includes links to slides – updated 27/04/2022

Timing	Plenary / Strand A Sessions Auditorium 2	Strand B Sessions Auditorium 1
09:15	Registration + teas/coffee
10:15	Introduction
10:25	Opening Plenary Chair: Cristin Merritt
	Reaching the cloud – Steven Chapman, University of Bath
	Question time
10:55	Break – teas/coffee
11:15	Session 1a – User Experiences Chair: Jay DesLauriers	Session 1b – User Experiences Chair: Cristin Merritt
	Transitioning research computing workloads to the cloud: A thematic approach at Cardiff University – Tom Green, Cardiff University	Global Symmetry is important for the detection of abnormality in mammograms – Cameron Kyle-Davidson, University of York
	On the creation of a secure ‘serverless’ workflow between a Mapbox frontend and a SalesForce backend for the Tekkatho Foundation – Mike Jones, Independent Researcher	Capturing a Moment in the Cloud Workshop: How our perception and use of cloud computing has changed over time – Wil Mayers, Alces Flight
	Reducing time-to-science with self-service HPC and AI platforms in the Azimuth portal – Matt Pryor, Stack HPC	Crawlers, Bots, Flows, Lambdas, Glues and Autopilots: Applying AI and ML to Radiological Sensor Networks for Safety and Security – Peter Martin, University of Bristol
	Question time	Question time
12:15	Lunch – buffet lunch
13:30	Plenary (invited speakers) Chair: Cristin Merritt
	Developing and using the UK Biobank Research Analysis Platform, a large-scale Trusted Research Environment – Oliver Gray and Przemyslaw Stempor, UK BioBank
	CLIMB-COVID: Cloud Infrastructure to Support the UK’s Covid-19 Response – Radoslaw Poplawski and Nick Loman, University of Birmingham
	Question time
14:15	Session 2a – Trusted Research Environments Chair: David Fergusson	Session 2b – HPC in the Cloud – Applications Chair: Stig Telfer
	TREEHOOSE: Trusted Research Environment and Enclave Hosting Open Original Scientific Exploration – Simon Li, University of Dundee	Twins in the Cloud: Simplifying the Deployment of Digital Twins for Manufacturing-as-a-Service – Jay DesLauriers, University of Westminster
	Data Safe Haven Classification and Trusted environments in the cloud: extending a Turing Django based application across multiple institutions – Rebecca Osselton, Newcastle University	An Introduction to DosNA: Distributed Numpy Arrays for High-performance cloud computing – Gabryel Mason-Williams, Rosalind Franklin Institute
	The Genes And Health TRE in the Google cloud – Vivek Iyer, Wellcome Sanger Institute	Maintaining versioned 3D digital designs using a hybrid and multi-cloud solution – Niall Kennedy, YellowDog
	Question time	Question time
15:15	Break – teas/coffee
15:35	Session 3a HPC in Cloud Chair: Alex Dibbo	Session 3b Covid & Cloud Chair: Stephanie Thompson
	The PITHIA-NRF e-Science Centre – towards a Cloud-based Platform to support Ionosphere, Thermosphere, and Plasmasphere Research – Tamas Kiss, University of Westminster	Supporting UK Covid-19 surveillance with AWS Step Functions and Fargate at Wellcome Sanger Institute – Sam Proctor, Wellcome Sanger Institute
	King’s CREATE: a new research computing ecosystem and the journey so far – Matt Penn, King’s College London	*Laying the groundwork for discovering the next* novel coronavirus –** Brendan Bouffler, Amazon Web Services
	Ensuring fairer access and reducing obstacles to research in fixed capacity clouds – Paul Browne, University of Cambridge & Pierre Riteau, Stack HPC	Cloud-based nonequilibrium simulations to investigate regulation and environmental effects in the SARS-CoV-2 spike protein – Sofia Oliveira, University of Bristol
	Question time	Question time
16:35	Reconvene in Auditorium 2 …
16:45	Final Plenary Chair: James Grant
	DRI Update – Justin O’Byrne, UKRI Digital Research Infrastructure
17:05	Sum-up, feedback, next steps
17:10	Reception – drinks, light refreshments
18:00	Close

ukri-cloud-workshop-2022-programme_v03 Download

Featured by Steph Thompson

UKRI Cloud Workshop 2022 – Call for Participation

Our next workshop is on March 29^th 2022!

We will be back at The Francis Crick Institute in central London. As with previous events we are looking forward to a varied programme from the UK research community. We expect the workshop to be a mix of technical talks with researchers reporting on their use of cloud technologies.

Abstract Submission

We’re looking for talk submissions covering all aspects of research computing using cloud, both public and private. We wish to have a diverse set of viewpoints represented at this workshop and encourage individuals and institutions of all backgrounds (for example academic, technical, business, or user experience) to apply.

We’ve provided some suggested topics and themes below, but submissions outside of these areas are also welcome. Talk sessions are typically 20 minutes and should include time for questions. We’d also like to record and share the videos and slides afterwards, so please make it clear in your submission if this is likely to be an issue. To submit an abstract complete this form:

https://forms.gle/u7XEPpB79o2Y4rM27

The deadline for submissions has been extended to Monday 24^th January (EOB). The Cloud Working Group will review the submissions and we’ll let successful submitters know by 31^st January.

Workshop Presentation Format

Due to the ongoing situation, we may need to limit attendance. As such, this year we will be providing an in-person and live streaming experience and we can now accommodate virtual speakers for this event.

Please note that all plans being made are subject to change based on the current health and safety guidance set by the UK Government.

Proposed Workshop Themes

High Performance Computing (HPC)

We would like to hear from those who have user stories involving running HPC-class workloads in public cloud. Stories can also include utilising cloud-native methods to create software-defined HPC infrastructure; hybrid solutions that extend on-premise compute infrastructure with cloud bursting, or adapting HPC workflows to exploit cloud-native technologies, for example.

Cloud Pilots and User Experiences

We would like to hear from operators and users about their experiences of running scientific workloads in private and public cloud environments. How does this compare with traditional HTC and HPC facilities? Have you found any advantages and/or disadvantages that we should know about? Do you use any abstraction layers to make them more usable?

Hybrid Cloud

We are looking for examples of deployments which bridge the gap between on-premise infrastructure and public cloud, or between cloud providers. This could include; efforts to make workloads portable between clouds, creating cloud services or cloud access to enhance the current solution offered, or technologies to support migration and bursting, for example. In addition, topics covering data movement/migration and data collaboration/sharing have been of a particular interest to our community in the past and would fit within this theme.

UKRI and Trusted Environments

Following the recent UKRI and DARE UK call to inform design of cross-council digital research environments, we are keen to hear from projects that are using cloud to enable research and collaboration with sensitive data. We are particularly keen to hear from groups that are awarded funding – to provide an early platform for the projects to share previous experience that helped them secure funding and hear what they aim to achieve, as well as how they plan to share their solutions with the wider community.

COVID and Cloud

The global COVID pandemic has presented unprecedented challenges in health and economics and research has been at the forefront of addressing these, from modelling transmission to simulating viral proteins and treatments. We are keen to showcase stories from the research community where cloud has enabled projects; to share practices for operating under demanding conditions and time constraints, but also celebrate the work that is helping to ease us out of the pandemic restrictions.

Performance/ResOps

The recent COP26 conference has fired the starting gun on reducing greenhouse gas emissions to Net Zero. It is no longer an option to simply write code/applications and workflows without ensuring these have been performance optimised within reason. Neither cloud resource/technology/providers, application providers or cloud users can leave it to each other to ensure that workflows have as small a carbon footprint as possible. We are keen to hear from the cloud communities about work that has increased/maintained performance while reducing energy use. In particular we would like to hear about the role of the ResOps professional in ensuring that workflows/applications interact with cloud/cloud technologies in a power efficient manner. We hope this will share best practice and perhaps lead to further workshops and work in this challenging area.

Proposed talks are not limited to these themes but can also be in other areas of interest. Past themes have included storage, governance, IOT/data analytics and challenges faced and overcome when implementing a cloud solution at both business and technical levels.

The Program Committee listed here will make the final decision on the inclusion of any presentations to the meeting.

Featured by philipjkershaw

Thoughts from Cloud Workshop 2019

It’s a couple of months since the workshop and plenty of time to let the dust settle and reflect on the content. You can find most of the presentations from the workshop if you look follow the links from programme.

As I mentioned in my introduction at the meeting, I’ve noticed a transition over the past year in the adoption and application of cloud and this evident in the abstracts submitted for this meeting. There are signs of a maturing – in the first couple of annual workshops we held, cloud usage was very much at the experimental stage with early forays into private cloud deployment and first pilots testing out public capability. This year there were good examples of sophisticated application of cloud technology whether cloud-native applications like Chris Woods’ – use of serverless to dynamically trigger provision of clusters for batch computing – or in-depth demos of DevOps tooling from StackHPC and others.

Late last year, the Cloud WG ran a smaller technical meeting with no formal agenda – in ‘unconference’ style. This gave us an opportunity to do more of a deep dive with DevOps technologies. The positive feedback we received reflected the value in networking and learning together with peers. There was something of this continued at this year’s workshop with the afternoon demo session. It was great to have this in-depth technical input alongside higher level presentations, whether overviews of projects or talks around challenge areas such as policy. João Fernandes shared about the OCRE project in his presentation. This builds on the work of the GÉANT IaaS Framework, important for the establishment of agreement with public cloud providers for access to their resources for the research community.
On the topic of policy, the debate continued around the relative merits of public cloud versus on-premise hosting. Cliff Addison (University of Liverpool) highlighted the tensions between quantifying benefits, budgeting at scale and maintaining portability between cloud vendors. Owen Thomas (Red Oak Consulting) challenged assumptions with traditional HPC provision and made the case for assessment of overall value not just cost when making comparisons with public cloud. Andrew Jones (NAG) argued against absolutes when considering the complexities in making choices for hosting for any given application. Migration to cloud can present enormous challenges as Tony Wildish’s presentation illustrated. He provided a walkthrough of different approaches for migration legacy code developed for on-premise to operate efficiently on cloud drawn from EMBL-EBI’s experiences. Elsewhere in the meeting HEPCloud and UKAEA presentations show how hybrid models can be built up to select the required computing resources from on-premise and public cloud resources. HEPCloud in particular, illustrating the benefit of public cloud to overspill from research infrastructure in order to meet peaks in demand.

CRC Canada is an example of a complete public cloud solution architected from the ground up. What is interesting here is the organisational and culture shifts needed to support that model. In particular, the set up of dedicated effort for auditing and accounting when moving to a consumption based approach to billing. Pangeo – presented by the Met Office Informatics Lab – demonstrates another cloud enabled solution but what is of interest is the formation of a collaboration bringing together open source solutions to make a platform that is cloud-ready. At its core is a virtual research environment built largely on Jupyter and Dask together with use of Kubernetes and deployment glue to make it cloud-agnostic. This kind of solution fits for data analytics where typically datasets have been imported into a cloud environment and manipulated into a form that is analysis-ready. Use of BinderHub – shown with Pangeo and Sarah Gibson’s demo (Turing) – allows infrastructure to be dynamically provisioned and scientific workflows conveniently shared via Jupyter Notebooks.

In general though, examples of long-term hosting of large volumes of research data on public cloud however are still absent. If there’s a pattern from the sample of submissions for the workshop, it’s one of use of public cloud for compute rather than data storage: continued use of on-premise for long-term hosting of data with some bursting to public cloud for batching computing. Cloud is utilised as a means to obtain an ephemeral computational resource: set-up an environment, stage data, perform calculation, get results and tear down again. Even so, there appeared to be an increased awareness of the challenges of data hosting with cloud in some of the questions and discussion in the sessions. These included issues around hybrid and public cloud and multi-cloud scenarios. For example, if data is hosted in one cloud, how can it be accessed readily by a client with an existing presence hosted on another cloud? There are definite signs of progress in the community but clearly there are still big challenges for cloud to be more fully utilised for research workloads.

Featured by philipjkershaw

Programme for the UKRI Cloud Workshop

The programme is available here:

https://cloud.ac.uk/workshops/feb2019/

You can participate through Twitter #cloudwg and crowdsourced write-up via Etherpad

IMG_20161027_150054 (2) — Francis Crick Institute, venue for the workshop

Featured by philipjkershaw

Save the date 12 Feb 2019 – next Cloud Workshop

We will be holding our 4th annual workshop early next year on the 12th February 2019. We’re pleased to be back at our familiar venue the Francis Crick Institute in central London. Please save the date!

In past years we’ve had a great set of speakers from public cloud companies and major research institutes to individual researchers reporting on how they are exploiting cloud computing to meet their research goals. More details to follow soon.

Featured by Adam

EBI ResOps Course 3rd July 2017

Recently the European Bioinformatics Institute (EMBL-EBI) (an outstation of the European Molecular Biology Laboratory) based at the Wellcome Genome Campus, has been running a course on what they are calling ‘ResOps’. By this they mean essentially the adoption of DevOps-style practices and methods to research computing, notably pushing the latter in more of an infrastructure-as-code, cloud-native direction, thereby hoping to improve reproducibility and scalability.

The course was delivered by Dario Vianello and Erik van den Bergh, from the Technology and Science Integration team at the EBI. All materials from the course including presentations and code were available online, at https://bit.ly/resops_june. The team had also provided temporary accounts for course delegates on their Embassy Cloud, on which we could run practical exercises.

Dario contrasted two possible approaches to moving a scientific pipeline to the cloud – mimicking on-site infrastructure versus a cloud-native approach. The former is clearly easier, but does mean that you will miss out on many of the benefits of running on a cloud. The latter requires a lot more effort initially and may not be feasible, depending on things like application support for object storage, but it does offer great potential advantages, not least of all being lower costs.

As well as taking us through the process of porting a pipeline to a cloud, there were three lightning talks showing concrete examples of existing and ongoing ResOps work. These comprised Marine Metagenomics, RenSeq and the Phenomenal project. One common theme here is that the porting work provided a very valuable opportunity to profile and optimise the existing pipeline, deriving detailed performance characteristics that in some cases weren’t available before. In the Metagenomics case, it was necessary to remove site-specific code and assumptions that made sense when running on local compute resources. It is also important to work on the scalability of research codes, and to test scaling up on clouds, because in some cases it may be that beyond a certain point adding more cloud instances will not increase performance. If working on a public cloud, this would mean paying more for no benefit. The PhenoMeNal project was particularly impressive in its adoption of modern technologies for the virtual research environment portal, while the RenSeq work comprised an interesting example of containerising a workload, and hosting the containers within VMs. They found that it was useful to include a specific reference genome in the Docker image, even though that meant a very large 9GB image size.

PhenoMeNal architecture

In the afternoon we worked on our own test application, using Terraform to create the cloud resources, and Ansible for configuration, all pulling from publicly available GitHub repositories.

The EBI Cloud Portal

I was very impressed with the thoroughness of the course and its realism. While recognising the advantages of cloud-native computing, they were realistic about the challenges of moving scientific pipelines in that direction, and provided valuable real-world experiences that will help making decisions about how and where to host those pipelines. The fact that the materials are completely open made it even more valuable. I thoroughly recommend taking the course to anyone interested in moving science to the cloud.

Featured by philipjkershaw

Cloud Workshop

The Francis Crick Institute

It’s been a few months since our November workshop so there’s been some time to digest and reflect on some of the common themes emerging. Having attended a couple of other conferences and workshops from my community (AGU and AMS) it’s interesting to compare.

Firstly, it was great to see such a variety of application areas represented. For this our second annual workshop, we opened it for the submission of abstracts and this made such a difference. There was a great response, Life sciences having the margin on other domain areas. We had 160 register and 120 on the day. It was fantastic to have the Crick as a venue. It worked really well.

The first session looked at applications of hybrid and public cloud. Two really interesting use cases (Edinburgh and NCAS, NERC) looked at trying out HPC workloads on public cloud. This raised issues around comparative performance and costs between public cloud and on-prem HPC facilities.

On AWS, Placement Groups allow instances to put close to one another to improve inter-node communication for MPI-based workloads. This showed comparable performance with Archer (UK national supercomputer) for smaller workloads but clearly there was some limit as this tailed off as the number of nodes increased whereas Archer performance continued linearly with increase in scale. This tallies with what I’ve seen anecdotally at the AMS conference where there seemed to be on the one hand increasing uptake of public cloud for use with Numerical Weather Prediction jobs (which need MPI). However, this seemed to be being done for smaller scale workloads where they can stay within the envelope of the node affinity features available.

Another theme was portability – what kind of approaches can be used engineer workloads so that they can be easily moved between providers. Andrew Lahiff from STFC, presented a very different use case showing how container technologies can be used to run for Particle Physics, cases where there the focus is high-throughput rather than HPC requirements and so much more amenable to cloud. This work has been done a part of a pilot for the Cloud Working Group to specifically investigate how containers and container orchestration technology can be used to provide an abstraction layer for cloud interoperability. A really nice slide showed Kubernetes clusters running on Azure and Google cloud managed from the same command line console app. Dario Vianello’s talk (EMBL-EBI) showed how an alternative approach using a combination of Ansible and Terraform can be used to deploy workloads across multiple clouds.

Microsoft’s Kenji Takeda presents on recent Azure developments

It was great to have talks from hyper-scale cloud providers AWS, Azure and Google. The scale in hyper-scale is as ever impressive as is the pace of change in technology: very interesting to see Deep Learning applications driving the development of custom hardware – TPUs and FPGAs. Plans underway to host data centres in the UK will ease uptake. OpenStack Foundation and Andy McNab‘s talks showed examples of federation across OpenStack clouds.

In the private cloud session, Stig Tefler gave a nice illustration of network performance for VMs showing how a number of aspects of virtualisation can be changed or removed to progressively improve network performance towards line rate. Alongside talks on private cloud, the parallel session looked at Legal, Policy and Regulatory Issues a critical issue for adoption of public cloud. Steven Newhouse gave some practical insights from a recent cloud tender for EMBL. There is clearly need for further work around these issues so that the community can better informed about choices. This is something that the working group will be taking forward.

For this workshop, we experimented with an interactive session – bringing together a group of around 20 delegates to work together on some technical themes agreed ahead of time including bulk data movement and cloud-hosting of Jupyter notebooks. There was plenty of useful interaction and discussion but we will need to look at the networking provision for the next time to ensure groups can get on with technical work on the day.

We discussed next steps in the final session. There is clearly interest in taking particular areas forward from the meeting: focus groups on technical areas like HTC and use of parallel file systems with cloud or organised around specific domains within the research community. Training figured also, in the form of a cloud carpentry course so that researchers can more readily get up and running using cloud. Looking forward, in each of these case we’re looking for discrete activities with an agreed set of goals and something to deliver at the end. Where possible we’re seeking to support relevant work that is already underway and initiate new work where there are perceived gaps. We will be looking at running smaller workshops targeted at specific themes in the coming months as a means to engage and disseminate some of this work.

Phil Kershaw, STFC & Cloud-WG chair

April 26, 2022 by Steph Thompson

DRI Update

Justin O’Byrne, UKRI Digital Research Infrastructure

April 26, 2022 by Steph Thompson

Supporting UK Covid-19 surveillance with AWS Step Functions and Fargate at Wellcome Sanger Institute

Sam Proctor, Wellcome Sanger Institute

During the COVID-19 pandemic the Wellcome Sanger Institute has been responsible for providing PHE/UKHSA with timely information on covid lineages. In order to provide the most accurate picture of the pandemic, all samples required frequent re-processing as new lineages were detected. As a proof-of-concept we implemented a hybrid pipeline using Airflow and the AWS cloud to enable at scale processing of all samples, resulting in processing times an order of magnitude less than that of using local infrastructure. Airflow was used to orchestrate tasks running locally whilst AWS Step Functions were used to manage tasks running in the cloud, this combination worked well in practice. Architected to ensure sensitive data remained on local infrastructure. Use of the AWS CDK to author the cloud stack in C# and Python allowed for fast development and ease of environment separation. Docker Images allowed us to use existing code, rapidly deployed into AWS. Whilst making extensive use of AWS Lambda and AWS Fargate eliminated the need to manage clusters. Discussed are the lessons learnt from this project and the benefits that we have seen. It serves as a useful reference for those wishing to undertake a similar project.

April 26, 2022 by Steph Thompson

Ensuring fairer access and reducing obstacles to research in fixed capacity clouds

Paul Browne, University of Cambridge & Pierre Riteau, Stack HPC

Presenting resources as cloud now provides a familiar access mode for many research disciplines. For large-scale HPC and ML, optimal infrastructure is still best provided as on-premise, often in siloed systems that can create a disjunct in operation or service, but recently, however, such systems can be delivered in a cloud-native form. Hybrid cloud in which on and off premise resources can be exploited promises the best-of-both-worlds as research organisations explore the most cost-effective way of providing computational resources across the full gamut of education and research. In this talk we present an overview of the on-premise Cambridge cloud and how we are presenting clusters via a CaaS portal that can be used to create and manage platforms within multiple clouds, heralding a path to wider exploitation of resources.

April 26, 2022 by Steph Thompson

King’s CREATE: a new research computing ecosystem and the journey so far

Matt Penn, King’s College London

In summer 2020 the King’s College London e-Research team started reviewing options for replacing an ageing HPC cluster and OpenStack private cloud. Our primary goal for the refresh was to build a tightly integrated ecosystem with a high-degree of flexibility, catering to traditional scheduled HPC and more bespoke virtualised enclaves. Building King’s Computational Research, Engineering And Technology Environment (CREATE, launching Q1 2022) has taken us on a journey involving selection of OpenStack provisioning frameworks, opening a new data centre facility, adoption of Ubuntu and CephFS, institutional MFA integration, re-integration of incumbent storage and compute hardware, re-tooling our approach to software builds and vulnerability management. Our talk will describe our experiences building this ecosystem from the ground up which supports a highly diverse research community with workloads of all shapes and sizes.

April 26, 2022 by Steph Thompson

The PITHIA-NRF e-Science Centre – towards a Cloud-based Platform to support Ionosphere, Thermosphere, and Plasmasphere Research

Tamas Kiss, University of Westminster

The PITHIA Network of Research Facilities (PITHIA-NRF) project, funded by the European Commission’s H2020 programme, aims at building a distributed network, integrating observing facilities, data collections, data processing tools and prediction models dedicated to ionosphere, thermosphere and plasmasphere research. One of the core components of PITHIA-NRF is the PITHIA-NRF e-Science Centre that supports access to distributed data resources and facilitates the execution of various scientific applications on cloud computing infrastructures. The development is led by the University of Westminster, in strong collaboration with EGI. When designing and implementing the e-Science Centre, we follow a novel approach, based on the dynamic creation and instantiation of cloud-based reference architectures composed of multiple application components or microservices, described in the form of a deployment descriptor, that can be automatically deployed and managed at run-time. A reference architecture can include various components, such as generic or custom GUIs, data analytics, machine learning, simulation or other scientific applications, databases, and any other components that are required to realise a particular user scenario. This presentation focuses on the design principles of the e-Science Centre and demonstrates proof-of-concept cases studies.

April 26, 2022 by Steph Thompson

Maintaining versioned 3D digital designs using a hybrid and multi-cloud solution

Niall Kennedy, YellowDog

CAE Tech’s OASES3D (Open Architecture Storage and Execution Service for 3D) feasibility project is designing and prototyping a software architecture for maintaining 3D digital design and analysis data in a robust, version-controlled, scalable and future-proof way. The architecture is platform-agnostic, however a reference implementation is demonstrating and validating the feasibility as a core approach for management of design and analysis data at UKAEA (UK Atomic Energy Authority). A key feature of the OASES3D solution is to react to new versions of CAD or other design data by triggering analyses or simulations, so results can be tracked over the history of the design project. For the UKAEA STEP (Spherical Tokomak for Energy Production) project this is important as the design process will be long, complex and involve a large team, with a need for traceability of all decisions made. This dynamic creation of computing tasks results in a need for flexible provisioning of cloud compute resources and task scheduling. The scale of each task is not known in advance, and potential impacts of a design change could include a multitude of tasks. YellowDog has been chosen to provide this elasticity to create cloud compute clusters on demand at any scale, anywhere.

April 26, 2022 by Steph Thompson

An Introduction to DosNA: Distributed Numpy Arrays for High-performance cloud computing

Gabryel Mason-Williams, Rosalind Franklin Institute

The cloud-primarily deals with data as object stores such as S3; however, HPC data processing is primarily done using filesystems such as HDF5, which can make offloading data to the cloud difficult. DosNa is a python wrapper that can distribute N-dimensional arrays over an Object Store server. The main goal of DosNa is to provide an easy and seamless interface to store and manage N-Dimensional datasets over a remote cloud. It supports S3 and Ceph backends and allows parallelised data access through the MPI engine. Currently, features to allow for converting HDF5 files to DosNa Objects, an API to visualise data, object locking, BLOSC compression, and checksums are underway. This talk introduces DosNa and showcases the current features and what’s to come.

April 26, 2022 by Steph Thompson

The Genes And Health TRE in the Google cloud

Vivek Iyer, Wellcome Sanger Institute

Genes & Health (G&H) project is a large (c 50,000 donors) population-based cohort of British Pakistanis and Bangladeshis. The project’s goal is to investigate the genetic contribution to common diseases in this community, and to identify rare homozygous gene knockouts and their consequences. This involves jointly analysing genetic data and electronic health record data inside a certified TRE. The project is a collaboration between QMUL, KCL, WSI and a consortium of pharma partners. The G&H TRE is provisioned and administered by us (WSI) entirely in a Google cloud environment built solely for this project. It allows separated bubbles of scientists from different organisations to work securely with virtual desktops on sensitive data, whilst still being able to share selected data between bubbles. The TRE also allows users to run High Performance Compute (HPC) within their secure bubbles. I will sketch why the project chose this TRE, its architecture, the way it’s used, and how we are approaching certification. (Note: The codebase was licensed from U Helsinki and written by Solita for the Finngen project.)

April 26, 2022 by Steph Thompson

Data Safe Haven Classification and Trusted environments in the cloud: extending a Turing Django based application across multiple institutions

Rebecca Osselton, Newcastle University

Data Safe Havens provide a cloud deployed secure and robust research environment for dataset exploration. This data may be sensitive in nature and the Data Safe Haven gives institutions a trusted environment in which to develop and extend their research. Safe Havens require a level of security classification from where no sensitive data is used, to the highest level of security, such as those needed by governments and defence agencies. The Data Safe Haven Classification app is a web-based Information Governance application that guides stakeholders through a process to determine the correct level of classification. Users have defined roles within the system and must answer a sequence of questions to determine the correct level of security. The app exists independently from the Safe Haven environment and allows the linking of multiple datasets across work packages, giving flexibility to institutions, while holding no sensitive data internally. Work to improve and increase the portability of the classification app is underway with multiple institutions including University College London, Newcastle University, University of Cambridge and the Alan Turing Institute. This presentation will discuss features of the app and challenges in its distribution across different institutions, in terms of technology and policy.

April 26, 2022 by Steph Thompson

TREEHOOSE: Trusted Research Environment and Enclave Hosting Open Original Scientific Exploration

Simon Li, University of Dundee

Trusted Research Environments (TREs) are critical for enabling research on sensitive data. Traditionally, they require large, up-front capital, investment in specialist infrastructure which can struggle to keep pace with user demands for increased power and flexibility. At the Health Informatics Centre (HIC) we have designed a TRE in the cloud to be able to scale with the additional demands made by complex imaging datasets and machine learning experiments. The implementation required considerable custom work, with a challenging learning curve for operations staff. We are developing an open-source toolkit including code and documentation to streamline the deployment of a public cloud TRE. It will share the knowledge and lessons learnt so far in developing and running the HIC TRE and assist more institutions in making their data securely accessible at scale. This includes processes for management of customised research environments, and examples of taking advantage of specialised cloud services that are challenging in a traditional TRE. The toolkit will enable future federated analytical workflows across TREs, since a common codebase, and ultimately open standards, aids portability of code and reproducibility.