Skip to content
Snippets Groups Projects
k204229's avatar
Etor Lucio Eceiza authored
fix: horizontal force-chunking now disregard the time original chunksize, simplify printout in check_chunk_size
76521cdd
History

Rechunking NetCDF data.

Rechunking of existing netcdf files to an optimal chunk size. This code provides a simple command line interface (cli) to rechunk existing netcdf data to an optimal chunksize of around 128 MB.

Installation

To install the cli simply use the following pip command:

python -m pip install rechunk-data --extra-index-url https://gitlab.dkrz.de/api/v4/projects/66397/packages/pypi/simple

Use the --user flag if you do not have super user rights and are not using anaconda, pipenv or virtual env

Usage

Using the python module

from rechunk_data import rechunk_dataset
import xarray as xr
dset = xr.open_mfdataset("/data/*", parallel=True, combine="by_coords")
new_data = rechunk_dataset(dset)

Using the command line interface:

rechunk-data  --help                   
usage: rechunk-data [-h] [-o OUTPUT] [--netcdf-engine {h5netcdf,netcdf4}] [--skip-cf-convention] [--force-horizontal] [--check-chunk-size] [-v] [-V] input

Rechunk input netcdf data to optimal chunk-size. approx. 126 MB per chunk

positional arguments:
  input                 Input file/directory. If a directory is given all ``.nc`` files in all sub directories will be processed

options:
  -h, --help            show this help message and exit
  -o, --output OUTPUT   Output file/directory of the chunked netcdf file(s). Note: If ``input`` is a directory output should be a directory. If None given (default) the
                        ``input`` is overridden. (default: None)
  --netcdf-engine {h5netcdf,netcdf4}
                        The netcdf engine used to create the new netcdf file. (default: netcdf4)
  --skip-cf-convention  Do not assume assume data variables follow CF conventions. (default: False)
  --force-horizontal, -fh
                        Force horizontal chunking (~126 MB per chunk). (default: False)
  --check-chunk-size, -c
                        Check the chunk size of the input dataset (in MB). (default: False)
  -v                    Increase verbosity (default: 0)
  -V, --version         show program's version number and exit

You can either use the cli in various ways:

  • specified input - output file pairs. Here input and output have to be files.
  • all files within an input directory will are stored in an output directory. Here input and output have to be directories.
  • override a specified input file, or override all files within an input directory. Here omit the --output flag.

Support

If you need help submit an issue in the gitlab repository.