Skip to content
Snippets Groups Projects
Commit 4639c055 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Merge branch 'setup-for-ci' into 'master'

Setup for ci

See merge request !87
parents d6454428 2b2fbd10
Branches master
No related tags found
1 merge request!87Setup for ci
Pipeline #39261 passed
%% Cell type:markdown id:6e0b3c7d-b3ac-4013-979a-f4b2012b48f7 tags: %% Cell type:markdown id:6e0b3c7d-b3ac-4013-979a-f4b2012b48f7 tags:
# Compression methods for NetCDF files # Compression methods for NetCDF files
with Xarray with Xarray
This Notebook gives a guideline how to use recent basic lossy and lossless compression for netdf output via `xarray`. It is an implementation of [these slides](https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements) using example data of Earth System Models AWI-CM. We will compare the writing speed and the compression ratio of different methods. This Notebook gives a guideline how to use recent basic lossy and lossless compression for netdf output via `xarray`. It is an implementation of [these slides](https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements) using example data of Earth System Models AWI-CM. We will compare the writing speed and the compression ratio of different methods.
For using lossless compression methods on netcdf files, we use the [hdf5plugin](https://github.com/silx-kit/hdf5plugin/blob/b53d70b702154a149a60f653c49b20045228aa16/doc/index.rst) tool that enables to use additional hdf5filters within python. For using lossless compression methods on netcdf files, we use the [hdf5plugin](https://github.com/silx-kit/hdf5plugin/blob/b53d70b702154a149a60f653c49b20045228aa16/doc/index.rst) tool that enables to use additional hdf5filters within python.
For *lossy* compression, we use the [numcodecs](https://github.com/zarr-developers/numcodecs) lib to calculate the bitrounding. For *lossy* compression, we use the [numcodecs](https://github.com/zarr-developers/numcodecs) lib to calculate the bitrounding.
We use the [BitShuffle](https://github.com/kiyo-masui/bitshuffle) filter prior to compression whenever possible. This will rearrange the binary data in order to improve compression. We use the [BitShuffle](https://github.com/kiyo-masui/bitshuffle) filter prior to compression whenever possible. This will rearrange the binary data in order to improve compression.
## Requirements ## Requirements
- Lossy compression requires a lot of memory. - Lossy compression requires a lot of memory.
- Reading lossless compressions other than deflated requires netcdf version 4.9 or newer with access to the HDF5 filters - Reading lossless compressions other than deflated requires netcdf version 4.9 or newer with access to the HDF5 filters
%% Cell type:markdown id:0f4d7d8d-e120-4d56-9b45-1bfd7eb4ba4b tags: %% Cell type:markdown id:0f4d7d8d-e120-4d56-9b45-1bfd7eb4ba4b tags:
## Lossless compression methods ## Lossless compression methods
### [zlib]() ### [zlib]()
zlib has been the standard compression method since it was introduced in netcdf. It is based on the deflate algorithm, other compression packages use dictionaries. zlib has been the standard compression method since it was introduced in netcdf. It is based on the deflate algorithm, other compression packages use dictionaries.
### [Zstandard](https://github.com/facebook/zstd) ### [Zstandard](https://github.com/facebook/zstd)
Zstandard allows multithreading. It is used by package registries and is supported by linux systemss. Zstandard offers features beyond zlib/zlib-ng, including better performance and compression. Zstandard allows multithreading. It is used by package registries and is supported by linux systemss. Zstandard offers features beyond zlib/zlib-ng, including better performance and compression.
### [LZ4](https://github.com/lz4/lz4) ### [LZ4](https://github.com/lz4/lz4)
LZ4 has a focus on very fast compression. It scales with CPUs. LZ4 has a focus on very fast compression. It scales with CPUs.
### [Blosc](https://github.com/Blosc/c-blosc) ### [Blosc](https://github.com/Blosc/c-blosc)
Blosc uses the blocking technique to not only reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations. Its default compressor is based on LZ4 and Zstandard. Blosc uses the blocking technique to not only reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations. Its default compressor is based on LZ4 and Zstandard.
%% Cell type:code id:911b4efb-47e2-422b-acf2-a3d8d37b1baf tags: %% Cell type:code id:911b4efb-47e2-422b-acf2-a3d8d37b1baf tags:
``` python ``` python
import hdf5plugin import hdf5plugin
import time import time
import fsspec as fs import fsspec as fs
import glob import glob
import xarray as xr import xarray as xr
import tqdm import tqdm
``` ```
%% Cell type:code id:888676a5-4c4c-4c49-8f97-7144e2bee12d tags: %% Cell type:code id:888676a5-4c4c-4c49-8f97-7144e2bee12d tags:
``` python ``` python
help(hdf5plugin) help(hdf5plugin)
``` ```
%% Cell type:code id:a598a965-73c3-4700-8de2-4afa26ec3702 tags: %% Cell type:code id:a598a965-73c3-4700-8de2-4afa26ec3702 tags:
``` python ``` python
%store -r times %store -r times
#print(times) #print(times)
times=0 times=0
``` ```
%% Cell type:markdown id:48311eee-fdf4-4e75-9ae6-ace523381037 tags: %% Cell type:markdown id:48311eee-fdf4-4e75-9ae6-ace523381037 tags:
On Levante, you can use the plugins from the CMIP pool directory `/work/ik1017/hdf5plugin/plugins/`: On Levante, you can use the plugins from the CMIP pool directory `/work/ik1017/hdf5plugin/plugins/`:
%% Cell type:code id:cad827c9-1445-481e-b24b-0017bf78e19b tags: %% Cell type:code id:cad827c9-1445-481e-b24b-0017bf78e19b tags:
``` python ``` python
hdf5plugin.PLUGIN_PATH="/work/ik1017/hdf5plugin/plugins/" hdf5plugin.PLUGIN_PATH="/work/ik1017/hdf5plugin/plugins/"
%set_env HDF5_PLUGIN_PATH={hdf5plugin.PLUGIN_PATH} %set_env HDF5_PLUGIN_PATH={hdf5plugin.PLUGIN_PATH}
``` ```
%% Cell type:markdown id:c7a91342-9e70-40ba-9389-54fca72bb949 tags: %% Cell type:markdown id:c7a91342-9e70-40ba-9389-54fca72bb949 tags:
We use the ocean surface temperature `tos` in this example: We use the ocean surface temperature `tos` in this example:
%% Cell type:code id:c346fc8b-f437-41e3-9c83-6f2ef49fe749 tags: %% Cell type:code id:c346fc8b-f437-41e3-9c83-6f2ef49fe749 tags:
``` python ``` python
source="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR/ssp370/r1i1p1f1/Omon/tos/gn/v20181218/tos_Omon_AWI-CM-1-1-MR_ssp370_r1i1p1f1_gn_201501-202012.nc" source="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR/ssp370/r1i1p1f1/Omon/tos/gn/v20181218/tos_Omon_AWI-CM-1-1-MR_ssp370_r1i1p1f1_gn_201501-202012.nc"
pwd=!pwd pwd=!pwd
pwd=pwd[0] pwd=pwd[0]
source_uncompressed=f"{pwd}/temp.nc" source_uncompressed=f"{pwd}/temp.nc"
sds=xr.open_mfdataset(source) sds=xr.open_mfdataset(source).isel(time=slice(1,13))
for var in sds.variables: for var in sds.variables:
sds[var].encoding["zlib"]=False sds[var].encoding["zlib"]=False
sds.to_netcdf(source_uncompressed) sds.to_netcdf(source_uncompressed)
``` ```
%% Cell type:code id:6c6bf599-ba32-416f-ad51-1ddb058cbd83 tags:
``` python
sds
```
%% Cell type:code id:a2b12787-39e0-45cb-8ebf-b02b95e50b3d tags: %% Cell type:code id:a2b12787-39e0-45cb-8ebf-b02b95e50b3d tags:
``` python ``` python
omon2d=xr.open_mfdataset( omon2d=xr.open_mfdataset(
source_uncompressed, source_uncompressed,
engine="h5netcdf", engine="h5netcdf",
parallel=False parallel=False
) )
``` ```
%% Cell type:markdown id:727f4776-5667-4e74-8c08-cea0855d217c tags: %% Cell type:markdown id:727f4776-5667-4e74-8c08-cea0855d217c tags:
The following "*compression* : *configuration*" dictionary is used to configure the `encoding` keyword argument in xarray's *to_netcdf*: The following "*compression* : *configuration*" dictionary is used to configure the `encoding` keyword argument in xarray's *to_netcdf*:
%% Cell type:code id:8a419d87-1dd1-4b3c-9d48-68c5ced53549 tags: %% Cell type:code id:8a419d87-1dd1-4b3c-9d48-68c5ced53549 tags:
``` python ``` python
comprdict=dict( comprdict=dict(
zlib=dict( zlib=dict(
engine="h5netcdf", engine="h5netcdf",
compr=dict( compr=dict(
zlib=True, zlib=True,
complevel=5 complevel=5
) )
), ),
zstd=dict( zstd=dict(
engine="h5netcdf", engine="h5netcdf",
#from python 3.11: #from python 3.11:
compr=dict(**hdf5plugin.Bitshuffle(cname="zstd")) compr=dict(**hdf5plugin.Bitshuffle(cname="zstd"))
#compr=dict(**hdf5plugin.Zstd()) #compr=dict(**hdf5plugin.Zstd())
), ),
lz4=dict( lz4=dict(
engine="h5netcdf", engine="h5netcdf",
#from python 3.11: #from python 3.11:
compr=dict(**hdf5plugin.Bitshuffle(cname="lz4")) compr=dict(**hdf5plugin.Bitshuffle(cname="lz4"))
#compr=dict(**hdf5plugin.Bitshuffle(lz4=True)) #compr=dict(**hdf5plugin.Bitshuffle(lz4=True))
), ),
blosc=dict( blosc=dict(
engine="h5netcdf", engine="h5netcdf",
compr=dict(**hdf5plugin.Blosc(cname='blosclz', shuffle=1)) compr=dict(**hdf5plugin.Blosc(cname='blosclz', shuffle=1))
) )
) )
``` ```
%% Cell type:code id:34e5bc3f-164e-4c40-b89f-a4e8d5866276 tags: %% Cell type:code id:34e5bc3f-164e-4c40-b89f-a4e8d5866276 tags:
``` python ``` python
comprdict["lz4"] comprdict["lz4"]
``` ```
%% Cell type:code id:1c356ce8-7d6f-4d47-a673-e8f1f698799f tags: %% Cell type:code id:1c356ce8-7d6f-4d47-a673-e8f1f698799f tags:
``` python ``` python
sourcesize=fs.filesystem("file").du(source_uncompressed) sourcesize=fs.filesystem("file").du(source_uncompressed)
print(f"The size of the uncompressed source file is {sourcesize/1024/1024} MB") print(f"The size of the uncompressed source file is {sourcesize/1024/1024} MB")
``` ```
%% Cell type:code id:e6202eb6-0897-4d18-b15e-69dd30b53c23 tags: %% Cell type:code id:e6202eb6-0897-4d18-b15e-69dd30b53c23 tags:
``` python ``` python
resultdir={} resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()): for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict() enc=dict()
for var in omon2d.data_vars: for var in omon2d.data_vars:
enc[var]=config["compr"] enc[var]=config["compr"]
start=time.time() start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression.nc", omon2d.to_netcdf(f"{pwd}/test_{compr}_compression.nc",
mode="w", mode="w",
engine=config["engine"], engine=config["engine"],
unlimited_dims="time", unlimited_dims="time",
encoding=enc, encoding=enc,
) )
end=time.time() end=time.time()
resultdir[compr]=dict( resultdir[compr]=dict(
speed=sourcesize/(end-start)/1024/1024, speed=sourcesize/(end-start)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression.nc")/sourcesize ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression.nc")/sourcesize
) )
``` ```
%% Cell type:code id:a054a804-0cd6-4f89-9d95-8805d79bac77 tags: %% Cell type:code id:a054a804-0cd6-4f89-9d95-8805d79bac77 tags:
``` python ``` python
with open(f"results_{str(times)}.csv","w") as f: with open(f"results_{str(times)}.csv","w") as f:
for k,v in resultdir.items(): for k,v in resultdir.items():
f.write(f"{k},{sourcesize},{v['speed']},{v['ratio']}\n") f.write(f"{k},{sourcesize},{v['speed']},{v['ratio']}\n")
``` ```
%% Cell type:markdown id:c1befda5-4bd5-4d1e-b586-8794b0046d85 tags: %% Cell type:markdown id:c1befda5-4bd5-4d1e-b586-8794b0046d85 tags:
### Reading not-deflated data ### Reading not-deflated data
Before open a file compressed with something else than zlib, you have to import hdf5plugin first: Before open a file compressed with something else than zlib, you have to import hdf5plugin first:
%% Cell type:code id:d2f290c4-31a3-48ab-b0a6-c360a8bf171e tags: %% Cell type:code id:d2f290c4-31a3-48ab-b0a6-c360a8bf171e tags:
``` python ``` python
import hdf5plugin import hdf5plugin
import xarray as xr import xarray as xr
outf=xr.open_dataset(f"{pwd}/test_zstd_compression.nc",engine="h5netcdf") outf=xr.open_dataset(f"{pwd}/test_zstd_compression.nc",engine="h5netcdf")
``` ```
%% Cell type:code id:6848c002-42d2-47e3-93a5-988995f40208 tags: %% Cell type:code id:6848c002-42d2-47e3-93a5-988995f40208 tags:
``` python ``` python
outf outf
``` ```
%% Cell type:markdown id:d6b06f6c-4e6c-41ef-84c7-f3ef0c011c8f tags: %% Cell type:markdown id:d6b06f6c-4e6c-41ef-84c7-f3ef0c011c8f tags:
## Lossy ## Lossy
1. Direct `BitRound`ing with 16 bits to be kept. This precision can be considered as similar to e.g. ERA5 data (24 bit Integer space). 1. Direct `BitRound`ing with 16 bits to be kept. This precision can be considered as similar to e.g. ERA5 data (24 bit Integer space).
1. Calculate number of bits with information level 0.99 via *xbitinfo*. 1. Calculate number of bits with information level 0.99 via *xbitinfo*.
%% Cell type:code id:bf7139de-3dda-4b9f-a04b-b70d3eb018f8 tags: %% Cell type:code id:bf7139de-3dda-4b9f-a04b-b70d3eb018f8 tags:
``` python ``` python
losstimestart=time.time() losstimestart=time.time()
import numcodecs import numcodecs
rounding = numcodecs.BitRound(keepbits=16) rounding = numcodecs.BitRound(keepbits=16)
for var in omon2d.data_vars: for var in omon2d.data_vars:
if var == "time_bnds": if "bnds" in var :
continue continue
omon2d[var].data=rounding.decode( omon2d[var].data=rounding.decode(
rounding.encode( rounding.encode(
omon2d[var].load().data omon2d[var].load().data
) )
) )
losstimeend=time.time() losstimeend=time.time()
``` ```
%% Cell type:code id:358dc7c4-5186-4265-a5f5-8d7ee7e73c18 tags: %% Cell type:code id:358dc7c4-5186-4265-a5f5-8d7ee7e73c18 tags:
``` python ``` python
resultdir={} resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()): for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict() enc=dict()
for var in omon2d.data_vars: for var in omon2d.data_vars:
enc[var]=config["compr"] enc[var]=config["compr"]
start=time.time() start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy.nc", omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy.nc",
mode="w", mode="w",
engine=config["engine"], engine=config["engine"],
unlimited_dims="time", unlimited_dims="time",
encoding=enc, encoding=enc,
) )
end=time.time() end=time.time()
resultdir[compr]=dict( resultdir[compr]=dict(
speed=sourcesize/(end-start+losstimeend-losstimestart)/1024/1024, speed=sourcesize/(end-start+losstimeend-losstimestart)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy.nc")/sourcesize ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy.nc")/sourcesize
) )
``` ```
%% Cell type:code id:bbc7a341-cf54-4073-bf26-c1549ebe44e7 tags: %% Cell type:code id:bbc7a341-cf54-4073-bf26-c1549ebe44e7 tags:
``` python ``` python
with open(f"results_{str(times)}.csv","a") as f: with open(f"results_{str(times)}.csv","a") as f:
for k,v in resultdir.items(): for k,v in resultdir.items():
f.write(f"{k}_lossy,{sourcesize},{v['speed']},{v['ratio']}\n") f.write(f"{k}_lossy,{sourcesize},{v['speed']},{v['ratio']}\n")
``` ```
%% Cell type:markdown id:6cece0ed-0ab0-4fd1-a515-93f36b6753a8 tags: %% Cell type:markdown id:6cece0ed-0ab0-4fd1-a515-93f36b6753a8 tags:
### Xbitinfo ### Xbitinfo
%% Cell type:code id:8ff2e4f1-10ae-4b0b-9d4f-5ffbf871dbdb tags: %% Cell type:code id:8ff2e4f1-10ae-4b0b-9d4f-5ffbf871dbdb tags:
``` python ``` python
omon2d=xr.open_mfdataset( omon2d=xr.open_mfdataset(
source_uncompressed, source_uncompressed,
engine="h5netcdf", engine="h5netcdf",
parallel=False parallel=False
) )
``` ```
%% Cell type:code id:09e45692-8bdd-4a78-82eb-9bead71875bb tags: %% Cell type:code id:09e45692-8bdd-4a78-82eb-9bead71875bb tags:
``` python ``` python
import xbitinfo as xb import xbitinfo as xb
``` ```
%% Cell type:code id:3c021ba4-7172-4035-9305-ae61e92bb684 tags: %% Cell type:code id:3c021ba4-7172-4035-9305-ae61e92bb684 tags:
``` python ``` python
import time import time
bitinfostart=time.time() bitinfostart=time.time()
for var in omon2d.data_vars: for var in omon2d.data_vars:
if "bnds" in var: if "bnds" in var:
continue continue
dims=[dim for dim in omon2d[[var]].dims.keys() if "ncell" in dim] dims=[dim for dim in omon2d[[var]].dims.keys() if "ncell" in dim]
print(dims) print(dims)
if dims: if dims:
bitinfo = xb.get_bitinformation(omon2d[[var]], dim=dims, implementation="python") bitinfo = xb.get_bitinformation(omon2d[[var]], dim=dims, implementation="python")
keepbits = xb.get_keepbits(bitinfo, inflevel=0.99) keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)
print(keepbits) print(keepbits)
if keepbits[var][0] > 0 : if keepbits[var][0] > 0 :
print(keepbits[var][0]) print(keepbits[var][0])
omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits+2)[var] # this one wraps around numcodecs.bitround omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits)[var] # this one wraps around numcodecs.bitround
bitinfoend=time.time() bitinfoend=time.time()
``` ```
%% Cell type:code id:0d2def50-c209-41ee-8604-5fdb8554f640 tags: %% Cell type:code id:0d2def50-c209-41ee-8604-5fdb8554f640 tags:
``` python ``` python
resultdir={} resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()): for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict() enc=dict()
for var in omon2d.data_vars: for var in omon2d.data_vars:
enc[var]=config["compr"] enc[var]=config["compr"]
start=time.time() start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy_xbit.nc", omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy_xbit.nc",
mode="w", mode="w",
engine=config["engine"], engine=config["engine"],
unlimited_dims="time", unlimited_dims="time",
encoding=enc, encoding=enc,
) )
end=time.time() end=time.time()
resultdir[compr]=dict( resultdir[compr]=dict(
speed=sourcesize/(end-start+bitinfoend-bitinfostart)/1024/1024, speed=sourcesize/(end-start+bitinfoend-bitinfostart)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy_xbit.nc")/sourcesize ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy_xbit.nc")/sourcesize
) )
``` ```
%% Cell type:code id:4f95632f-ac07-4192-8a96-b3bb98ebfde2 tags: %% Cell type:code id:4f95632f-ac07-4192-8a96-b3bb98ebfde2 tags:
``` python ``` python
with open(f"results_{str(times)}.csv","a") as f: with open(f"results_{str(times)}.csv","a") as f:
for k,v in resultdir.items(): for k,v in resultdir.items():
f.write(f"{k}_lossy_xbit,{sourcesize},{v['speed']},{v['ratio']}\n") f.write(f"{k}_lossy_xbit,{sourcesize},{v['speed']},{v['ratio']}\n")
``` ```
%% Cell type:markdown id:3d000b88-f995-4369-b87d-98e910c57e9a tags: %% Cell type:markdown id:3d000b88-f995-4369-b87d-98e910c57e9a tags:
### Write the results ### Write the results
%% Cell type:code id:1b054323-937d-4bd8-a785-d3f709a4ee26 tags: %% Cell type:code id:1b054323-937d-4bd8-a785-d3f709a4ee26 tags:
``` python ``` python
import pandas as pd import pandas as pd
import glob import glob
df = pd.concat((pd.read_csv(f,names=["type","insize","write_speed [Mb/s]","ratio"]) for f in glob.glob("results*.csv")), ignore_index=True) df = pd.concat((pd.read_csv(f,names=["type","insize","write_speed [Mb/s]","ratio"]) for f in glob.glob("results*.csv")), ignore_index=True)
``` ```
%% Cell type:code id:9c381d31-1156-4e95-ae07-b6cc942a9288 tags: %% Cell type:code id:9c381d31-1156-4e95-ae07-b6cc942a9288 tags:
``` python ``` python
df.groupby("type").mean()[["write_speed [Mb/s]","ratio"]].sort_values(by="write_speed [Mb/s]",ascending=False) df.groupby("type").mean()[["write_speed [Mb/s]","ratio"]].sort_values(by="write_speed [Mb/s]",ascending=False)
``` ```
%% Cell type:code id:802c5d41-2791-4e0d-bf1f-35a1e748c9ed tags: %% Cell type:code id:802c5d41-2791-4e0d-bf1f-35a1e748c9ed tags:
``` python ``` python
!rm test_*compression*.nc !rm test_*compression*.nc
``` ```
%% Cell type:code id:0fd679fe-140a-4ab3-9a09-0349ad6ab821 tags:
``` python
!rm temp.nc
```
%% Cell type:code id:31141e83-3c2c-4b05-a5ee-d070a9f7d004 tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment