Besides, we therefore extend the application options for the rarely used cloud
object storage.
## Swift backend for zarr:
Thanks to Pavan it is now possible to open and write zarr data directly with xarray into the object storage at dkrz. See
https://github.com/siligam/zarr-swiftstore
for details.
## Convert netCDF into zarr, save it in OS, all in one script:
Xarray allows us to open a netCDF file and save it as zarr. With the backend `zarr-swiftstore` we are now also able to directly write it into the object storage of dkrz.
See
`notebooks/cmip6_nc_to_zarr.ipynb`
## Zarr formatted CMIP6 data
In the notebook
`notebooks/access_and_download_cmip6.ipynb`
you will find a tutorial of how to access a zarr-formatted subset of a CMIP6 experiment.
### Best practise for Conversion and upload of netCDF into object storage as zarr within memory:
- Set up a dask client with `processes=True`
- We use `xarray.open_mfdataset` . This function works with dask arrays so that we have to work with dask anyway.
- "None" leads to memory issues
- Set sufficient memory for the job and dask!
- We convert datasets including the time range of an experiment. This requires a lot of memory depending on the frequency of the variable.
- ToDo: What happens if we use zarr in append mode?
- Close the client of dask when upload of time dataset is finished.
- The memory of each worker of a dask client is not cleaned up correctly so that otherwise we run into memory leaks
- Consider size limitations for chunk uploads (from small to big):
- 2GB is the limit for specific zarr compressors
- 5GB is the limit from swift
- Consider limitations for chunks by zarr and swift:
- <1MB leads to problems when uploading because .zarray needs to be updated very often
- Zarr requires uniform chunk sizes. Xarray only chunks along one file of the dataset. This is problematic when chunking the time dimension for daily variables because of the different numbers of time steps per year. Therefore, ...
- Use a temporarily disk storage for daily variables:
- This allows you to rechunk the data. Save the chunks as 1 chunk per time step. When uploading, you can change into a new chunk size.
- Set the following options when using xarray's open_mfdataset:
``` concat_dim="time",
data_vars='minimal',
coords='minimal',
compat='override'
```
- Especially the compat attribute is helpful since otherwise it can happen that the arrays of the coordinate variables get an artificial time dimension
- Do not use a preprocess function in open_mfdataset UNLESS you have very very much memory :)