Neil Massey, NCAS / Centre for Environmental Data Analysis, STFC
Using object stores, and cloud based storage utilising the Amazon S3 HTTP API, presents a number of opportunities and challenges in the storing of climate data. The distributed nature of the objects allows large data sets to be broken into “fragments”, each fragment containing a subset of the data. This allows for parallel access to the fragments, improving the performance of reading the data across a network.
However, this presents a number of problems. Firstly, determining the fragment size and the optimum method of splitting multi-dimensional data. Secondly, enabling meaningful search of data, when the data may be widely distributed.This poster will present a new method of splitting netCDF files into fragments and storing each fragment as an object in an S3 compatible object store. The location of the fragments, the metadata and dimensions for each climate variable are stored in a master file, which can be written to a location not within the object store, for example a POSIX file system on a SSD. This allows fast search of the metadata without having to reassemble the fragments. Each fragment is stored as a self-contained netCDF file allowing reconstruction of the data if the master file is lost.