Fake it until you make it — Using Kerchunk to read NetCDF4 data on AWS S3 as Zarr for rapid data access

Published in

pangeo

6 min readJul 19, 2021

Note: This story has been updated to reflect the renaming of fsspec-reference-maker to kerchunk and to update the MultiZarrToZarr API change

This article and affiliated code/images are licensed CC-BY

Cloud storage is becoming a popular option for publishing large scientific datasets to the public. One example is NOAA GOES-16/17 (Geostationary weather satellite) data being hosted on Amazon Web Services S3 storage.

While data are available for rapid access, the file format of the data (NetCDF4) is not optimized for cloud storage and can therefore be slow to access in traditional cloud-native computing workflows.

Zarr is becoming a popular alternative to NetCDF4 for cloud storage/access, due to its parallel I/O and ability to read multi-petabyte datasets. More background on analysis-ready, cloud-optimized (ARCO) data can be found in this recent paper by the Pangeo community:

Cloud-Native Repositories for Big Scientific Data

Scientific data have traditionally been distributed via downloads from data server to local computer. This way of…

ieeexplore.ieee.org

However, reformatting the entire GOES-16/17 archive to Zarr would require large amounts of computing and storage capabilities. This begs the question:

Is there a way to (easily) utilize the parallel chunk-reading of Zarr on a NetCDF file stored in the cloud?

The answer, it turns out, is “yes” thanks to a new project called kerchunk, part of the fsspecproject. This code creates metadata JSONs describing existing NetCDF4/HDF5 files, their chunks, and where to access them. This can then be easily utilized by Zarr or Xarray to open datasets and multi-file datasets from the remote data, and do fast computations.

A follow-along Jupyter notebook for this tutorial can be found at https://github.com/lsterzinger/kerchunk-medium-tutorial with an interactive Binder notebook on the Pangeo Binder environment available here.

The README file in the Git repo also provides information on setting up the proper environment and installing kerchunk

Let’s go!

Import the required packages

from kerchunk.hdf import SingleHdf5ToZarr 
from kerchunk.combine import MultiZarrToZarrimport xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import s3fs
import datetime as dtimport logging
import fsspec
import ujson
from tqdm import tqdm
from glob import glob

List the files we will want to open

First, we need to list the files on S3 that we want to generate. We can do this using fsspec

fs = fsspec.filesystem('s3', anon=True)urls = ['s3://' + f for f in fs.glob("s3://noaa-goes16/ABI-L2-SSTF/2020/210/*/*.nc")]

This code returns all NetCDF files on 210th day of 2020 (July 28, 2020) and prepends “s3://” to each url.

Start a local dask client

import dask
from dask.distributed import Clientclient = Client(n_workers=8)
client

Generate the JSONS

Now, we will loop over each file in the file list, open it with fsspec, then use SingleHdf5ToZarr to generate a JSON object and write it to disk as ./jsons/<netcdf_filename>.json .

def gen_json(u):
    so = dict(
        mode="rb", anon=True, default_fill_cache=False,
        default_cache_type="none"
    )
    with fsspec.open(u, **so) as inf:
        h5chunks = SingleHdf5ToZarr(inf, u, inline_threshold=300)
        with open(f"jsons/{u.split('/')[-1]}.json", 'wb') as outf:
           outf.write(ujson.dumps(h5chunks.translate()).encode())

Make sure the ./jsons/ directory exists and run the gen_json() function with Dask

import pathlib
pathlib.Path('./jsons/').mkdir(exist_ok=True)dask.compute(*[dask.delayed(gen_json)(u) for u in urls])

You can also run it in serial without Dask with

for u in urls:
    gen_json(u)

This step will take some time, 1.5 minutes with 8 Dask workers or about 5–10 minutes in serial on a Pangeo Cloud deployment.

You should then have 24 JSON files in the ./jsons directory. These files contain metadata on the NetCDF4 files’ location, as well as their variables and chunk locations. They can also be shared for others to open the same files with this method without having to generate their own metadata JSONs.

Reading the individual files as Zarr with xarray

Now that we have the metadata JSONs created, it’s easy to tell xarray to open them as Zarr files, bypassing the NetCDF4/HDF5 APIs entirely.

First, we’ll get a list of the JSON files in our directory:

json_list = sorted(glob("./jsons/*.json"))

Then, we need to create a fsspec mapper to each file object described by a JSON file

m_list = []
for js in tqdm(json_list):
    with open(js) as f:
        m_list.append(fsspec.get_mapper("reference://", 
            fo=ujson.load(f), remote_protocol="s3",
            remote_options={"anon": True}))

Now m_list (or any of its contents) can be passed to xarray. To open all files with xr.open_mfdataset() , just make sure to specify engine='zarr'

ds = xr.open_mfdataset(m_list, combine='nested', concat_dim='t',
                       engine='zarr', coords='minimal', 
                       data_vars='minimal', compat='override')

Making a combined JSON file

The method above, where we use a single JSON to represent a single netCDF file, works well for small datasets but does not scale that great for larger datasets.

But good news! We can combine a set of existing JSONS into one multi-datafile JSON. First, set up the MultiZarrToZarr class:

json_list = sorted(glob('./jsons/*.json')mzz = MultiZarrToZarr(
    json_list,
    remote_protocol="s3",
    remote_options={'anon':True},
    concat_dims='t',
    inline_threshold=0
)

The first argument in MultiZarrToZarris a list of our json files. remote_protocol and remote_options are parameters describing access to the remote data files (in our case, anonymous access to S3).

concat_dims is the name of the dimension to combine the datasets along — in our case, we’re combining along time (t)

Once this class is set up, create the combined JSON file with

mzz.translate('./combined.json')

Translating to a combined JSON will take a little bit of time, but it shouldn’t take too long. After it’s been generated, you can use it to access the data in a similar way to the open_mfdataset() method above (except this way, we open a single dataset)

fs = fsspec.filesystem(
    "reference", 
    fo="./combined.json", 
    remote_protocol="s3", 
    remote_options={"anon":True},
    skip_instance_cache=True
)
m = fs.get_mapper("")
ds = xr.open_dataset(m, engine='zarr')

From here, we can do all the fun stuff we’re used to doing with xarray datasets. For example, to take spatial subset of the data over the east coast of the US, and an average in time, we can clearly see the Gulf Stream.

When I used s3fs to open each file and pass those to xarray directly, making this plot took 30+ minutes. Using the JSONs to open the data as Zarr instead brought this time down to <60 seconds!

subset = ds.sel(x=slice(-0.01,0.07215601),y=slice(0.12,0.09))  #reduce to GS regionmasked = subset.SST.where(subset.DQF==0)masked.mean("t", skipna=True).plot(vmin=14+273.15,vmax=30+273.15,cmap='inferno')

A new method of ‘faking’ Zarr formats from NetCDF4 files reduced the compute time for this image from ~30 minutes to <1 minute

The reason behind this incredible speed improvement the way Zarr accesses data. Zarr is able to pull chunks in parallel, greatly speeding up the reading of the raw data. What the generated JSONs do is they outline the location of the file, its variables, and its chunks.

Then, instead of using the NetCDF4/HDF5 APIs (which are not optimized for cloud read/write in the way Zarr is), Xarray is able to use the Zarr engine to pull the data chunks directly from the binary data. Since all the information on the chunk locations in the binary file is outlined as byte-ranges in the JSON, Xarray is able to only pull the chunks it needs rather than the entire data file.

Final thoughts

Rather than converting the existing data into the more cloud-friendly Zarr format, this method allowed us to harness some of the benefits of Zarr without any modification at the file source.

This has huge benefits, namely that instead of needing to convert and re-host enormous datasets (like GOES) one can simply host the JSON files publicly (and, importantly, cheaply) allowing anyone to gain the performance increases shown here.

About me

I’m a PhD Candidate in the UC Davis Cloud Physics Research Group my research involves clouds microphysics and modeling.

I’m on twitter @lucassterzinger

I would like to thank Rich Signell (USGS) and Martin Durant (Anaconda) for their help in learning this process. If you’re interested in seeing more detail on how this works, I recommend Rich’s article from 2020 on the topic.

I would also like to recognize Pangeo and Pangeo-forge who work hard to make working with big data in geoscience as easy as possible. Work on this project was done on the Pangeo AWS deployment.

This tutorial was done during my time at the National Center of Atmospheric Research (NCAR)’s Summer Internship in Parallel Computational Sciences (SIParCS) in Summer 2021. I would like to thank Chelle Gentemann (Farallon Institute), Julia Kent (NCAR), and Kevin Paul (NCAR) for their outstanding mentorship during this program.

I’m not an expert on this software or what it’s doing as in the backend, but I hope this tutorial will serve to provide a quick start reference for those looking to use it.