Commit f0f333bc authored by Marco Kulüke's avatar Marco Kulüke
Browse files

Update README.md

parent 7e59751f
......@@ -20,68 +20,10 @@ write_to_swift(mfs=mfs_towrite,outid=<container_name>,
```
This repo contains a
## Proof of concept for:
- Convert NetCDF into Zarr
- Zarr I/O with a swift backend in object storage
- Do all in one script without additional required disk space
## Tutorial Notebooks
We develop notebooks and packages to enable zarr I/O on the cloud object storage at dkrz.
This aims at unleashing the processing benefits that zarr formatted files in the cloud allow such as
- lazy processing
- parallel I/O
- decreased download traffic.
[Tutorials](https://gitlab.dkrz.de/mipdata/zarr-in-swift-objectstorage)
Besides, we therefore extend the application options for the rarely used cloud
object storage.
## Swift backend for zarr:
Thanks to Pavan it is now possible to open and write zarr data directly with xarray into the object storage at dkrz. See
https://github.com/siligam/zarr-swiftstore
for details.
## Convert netCDF into zarr, save it in OS, all in one script:
Xarray allows us to open a netCDF file and save it as zarr. With the backend `zarr-swiftstore` we are now also able to directly write it into the object storage of dkrz.
See
`notebooks/cmip6_nc_to_zarr.ipynb`
## Zarr formatted CMIP6 data
In the notebook
`notebooks/access_and_download_cmip6.ipynb`
you will find a tutorial of how to access a zarr-formatted subset of a CMIP6 experiment.
------------------------------------------------------------
### Next steps:
We will write the zarr formatted data back to disk as netCDF and compare it with the originals.
This will show us if we can switch from netCDF to zarr for implementations of project data standards.
------------------------------------------------------------
### Best practise for Conversion and upload of netCDF into object storage as zarr within memory:
- Set up a dask client with `processes=True`
- We use `xarray.open_mfdataset` . This function works with dask arrays so that we have to work with dask anyway.
- "None" leads to memory issues
- Set sufficient memory for the job and dask!
- We convert datasets including the time range of an experiment. This requires a lot of memory depending on the frequency of the variable.
- ToDo: What happens if we use zarr in append mode?
- Close the client of dask when upload of time dataset is finished.
- The memory of each worker of a dask client is not cleaned up correctly so that otherwise we run into memory leaks
- Consider size limitations for chunk uploads (from small to big):
- 2GB is the limit for specific zarr compressors
- 5GB is the limit from swift
- Consider limitations for chunks by zarr and swift:
- <1MB leads to problems when uploading because .zarray needs to be updated very often
- Zarr requires uniform chunk sizes. Xarray only chunks along one file of the dataset. This is problematic when chunking the time dimension for daily variables because of the different numbers of time steps per year. Therefore, ...
- Use a temporarily disk storage for daily variables:
- This allows you to rechunk the data. Save the chunks as 1 chunk per time step. When uploading, you can change into a new chunk size.
- Set the following options when using xarray's open_mfdataset:
``` concat_dim="time",
data_vars='minimal',
coords='minimal',
compat='override'
```
- Especially the compat attribute is helpful since otherwise it can happen that the arrays of the coordinate variables get an artificial time dimension
- Do not use a preprocess function in open_mfdataset UNLESS you have very very much memory :)
## Performance Testts
[Performance Tests](https://gitlab.dkrz.de/b381359/parallelrechnerevaluation-projekt)
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment