Skip to content
Snippets Groups Projects
Commit 2b2fbd10 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Setup for ci

parent d6454428
No related branches found
No related tags found
1 merge request!87Setup for ci
%% Cell type:markdown id:6e0b3c7d-b3ac-4013-979a-f4b2012b48f7 tags:
# Compression methods for NetCDF files
with Xarray
This Notebook gives a guideline how to use recent basic lossy and lossless compression for netdf output via `xarray`. It is an implementation of [these slides](https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements) using example data of Earth System Models AWI-CM. We will compare the writing speed and the compression ratio of different methods.
For using lossless compression methods on netcdf files, we use the [hdf5plugin](https://github.com/silx-kit/hdf5plugin/blob/b53d70b702154a149a60f653c49b20045228aa16/doc/index.rst) tool that enables to use additional hdf5filters within python.
For *lossy* compression, we use the [numcodecs](https://github.com/zarr-developers/numcodecs) lib to calculate the bitrounding.
We use the [BitShuffle](https://github.com/kiyo-masui/bitshuffle) filter prior to compression whenever possible. This will rearrange the binary data in order to improve compression.
## Requirements
- Lossy compression requires a lot of memory.
- Reading lossless compressions other than deflated requires netcdf version 4.9 or newer with access to the HDF5 filters
%% Cell type:markdown id:0f4d7d8d-e120-4d56-9b45-1bfd7eb4ba4b tags:
## Lossless compression methods
### [zlib]()
zlib has been the standard compression method since it was introduced in netcdf. It is based on the deflate algorithm, other compression packages use dictionaries.
### [Zstandard](https://github.com/facebook/zstd)
Zstandard allows multithreading. It is used by package registries and is supported by linux systemss. Zstandard offers features beyond zlib/zlib-ng, including better performance and compression.
### [LZ4](https://github.com/lz4/lz4)
LZ4 has a focus on very fast compression. It scales with CPUs.
### [Blosc](https://github.com/Blosc/c-blosc)
Blosc uses the blocking technique to not only reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations. Its default compressor is based on LZ4 and Zstandard.
%% Cell type:code id:911b4efb-47e2-422b-acf2-a3d8d37b1baf tags:
``` python
import hdf5plugin
import time
import fsspec as fs
import glob
import xarray as xr
import tqdm
```
%% Cell type:code id:888676a5-4c4c-4c49-8f97-7144e2bee12d tags:
``` python
help(hdf5plugin)
```
%% Cell type:code id:a598a965-73c3-4700-8de2-4afa26ec3702 tags:
``` python
%store -r times
#print(times)
times=0
```
%% Cell type:markdown id:48311eee-fdf4-4e75-9ae6-ace523381037 tags:
On Levante, you can use the plugins from the CMIP pool directory `/work/ik1017/hdf5plugin/plugins/`:
%% Cell type:code id:cad827c9-1445-481e-b24b-0017bf78e19b tags:
``` python
hdf5plugin.PLUGIN_PATH="/work/ik1017/hdf5plugin/plugins/"
%set_env HDF5_PLUGIN_PATH={hdf5plugin.PLUGIN_PATH}
```
%% Cell type:markdown id:c7a91342-9e70-40ba-9389-54fca72bb949 tags:
We use the ocean surface temperature `tos` in this example:
%% Cell type:code id:c346fc8b-f437-41e3-9c83-6f2ef49fe749 tags:
``` python
source="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR/ssp370/r1i1p1f1/Omon/tos/gn/v20181218/tos_Omon_AWI-CM-1-1-MR_ssp370_r1i1p1f1_gn_201501-202012.nc"
pwd=!pwd
pwd=pwd[0]
source_uncompressed=f"{pwd}/temp.nc"
sds=xr.open_mfdataset(source)
sds=xr.open_mfdataset(source).isel(time=slice(1,13))
for var in sds.variables:
sds[var].encoding["zlib"]=False
sds.to_netcdf(source_uncompressed)
```
%% Cell type:code id:6c6bf599-ba32-416f-ad51-1ddb058cbd83 tags:
``` python
sds
```
%% Cell type:code id:a2b12787-39e0-45cb-8ebf-b02b95e50b3d tags:
``` python
omon2d=xr.open_mfdataset(
source_uncompressed,
engine="h5netcdf",
parallel=False
)
```
%% Cell type:markdown id:727f4776-5667-4e74-8c08-cea0855d217c tags:
The following "*compression* : *configuration*" dictionary is used to configure the `encoding` keyword argument in xarray's *to_netcdf*:
%% Cell type:code id:8a419d87-1dd1-4b3c-9d48-68c5ced53549 tags:
``` python
comprdict=dict(
zlib=dict(
engine="h5netcdf",
compr=dict(
zlib=True,
complevel=5
)
),
zstd=dict(
engine="h5netcdf",
#from python 3.11:
compr=dict(**hdf5plugin.Bitshuffle(cname="zstd"))
#compr=dict(**hdf5plugin.Zstd())
),
lz4=dict(
engine="h5netcdf",
#from python 3.11:
compr=dict(**hdf5plugin.Bitshuffle(cname="lz4"))
#compr=dict(**hdf5plugin.Bitshuffle(lz4=True))
),
blosc=dict(
engine="h5netcdf",
compr=dict(**hdf5plugin.Blosc(cname='blosclz', shuffle=1))
)
)
```
%% Cell type:code id:34e5bc3f-164e-4c40-b89f-a4e8d5866276 tags:
``` python
comprdict["lz4"]
```
%% Cell type:code id:1c356ce8-7d6f-4d47-a673-e8f1f698799f tags:
``` python
sourcesize=fs.filesystem("file").du(source_uncompressed)
print(f"The size of the uncompressed source file is {sourcesize/1024/1024} MB")
```
%% Cell type:code id:e6202eb6-0897-4d18-b15e-69dd30b53c23 tags:
``` python
resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict()
for var in omon2d.data_vars:
enc[var]=config["compr"]
start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression.nc",
mode="w",
engine=config["engine"],
unlimited_dims="time",
encoding=enc,
)
end=time.time()
resultdir[compr]=dict(
speed=sourcesize/(end-start)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression.nc")/sourcesize
)
```
%% Cell type:code id:a054a804-0cd6-4f89-9d95-8805d79bac77 tags:
``` python
with open(f"results_{str(times)}.csv","w") as f:
for k,v in resultdir.items():
f.write(f"{k},{sourcesize},{v['speed']},{v['ratio']}\n")
```
%% Cell type:markdown id:c1befda5-4bd5-4d1e-b586-8794b0046d85 tags:
### Reading not-deflated data
Before open a file compressed with something else than zlib, you have to import hdf5plugin first:
%% Cell type:code id:d2f290c4-31a3-48ab-b0a6-c360a8bf171e tags:
``` python
import hdf5plugin
import xarray as xr
outf=xr.open_dataset(f"{pwd}/test_zstd_compression.nc",engine="h5netcdf")
```
%% Cell type:code id:6848c002-42d2-47e3-93a5-988995f40208 tags:
``` python
outf
```
%% Cell type:markdown id:d6b06f6c-4e6c-41ef-84c7-f3ef0c011c8f tags:
## Lossy
1. Direct `BitRound`ing with 16 bits to be kept. This precision can be considered as similar to e.g. ERA5 data (24 bit Integer space).
1. Calculate number of bits with information level 0.99 via *xbitinfo*.
%% Cell type:code id:bf7139de-3dda-4b9f-a04b-b70d3eb018f8 tags:
``` python
losstimestart=time.time()
import numcodecs
rounding = numcodecs.BitRound(keepbits=16)
for var in omon2d.data_vars:
if var == "time_bnds":
if "bnds" in var :
continue
omon2d[var].data=rounding.decode(
rounding.encode(
omon2d[var].load().data
)
)
losstimeend=time.time()
```
%% Cell type:code id:358dc7c4-5186-4265-a5f5-8d7ee7e73c18 tags:
``` python
resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict()
for var in omon2d.data_vars:
enc[var]=config["compr"]
start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy.nc",
mode="w",
engine=config["engine"],
unlimited_dims="time",
encoding=enc,
)
end=time.time()
resultdir[compr]=dict(
speed=sourcesize/(end-start+losstimeend-losstimestart)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy.nc")/sourcesize
)
```
%% Cell type:code id:bbc7a341-cf54-4073-bf26-c1549ebe44e7 tags:
``` python
with open(f"results_{str(times)}.csv","a") as f:
for k,v in resultdir.items():
f.write(f"{k}_lossy,{sourcesize},{v['speed']},{v['ratio']}\n")
```
%% Cell type:markdown id:6cece0ed-0ab0-4fd1-a515-93f36b6753a8 tags:
### Xbitinfo
%% Cell type:code id:8ff2e4f1-10ae-4b0b-9d4f-5ffbf871dbdb tags:
``` python
omon2d=xr.open_mfdataset(
source_uncompressed,
engine="h5netcdf",
parallel=False
)
```
%% Cell type:code id:09e45692-8bdd-4a78-82eb-9bead71875bb tags:
``` python
import xbitinfo as xb
```
%% Cell type:code id:3c021ba4-7172-4035-9305-ae61e92bb684 tags:
``` python
import time
bitinfostart=time.time()
for var in omon2d.data_vars:
if "bnds" in var:
continue
dims=[dim for dim in omon2d[[var]].dims.keys() if "ncell" in dim]
print(dims)
if dims:
bitinfo = xb.get_bitinformation(omon2d[[var]], dim=dims, implementation="python")
keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)
print(keepbits)
if keepbits[var][0] > 0 :
print(keepbits[var][0])
omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits+2)[var] # this one wraps around numcodecs.bitround
omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits)[var] # this one wraps around numcodecs.bitround
bitinfoend=time.time()
```
%% Cell type:code id:0d2def50-c209-41ee-8604-5fdb8554f640 tags:
``` python
resultdir={}
for compr,config in tqdm.tqdm(comprdict.items()):
enc=dict()
for var in omon2d.data_vars:
enc[var]=config["compr"]
start=time.time()
omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy_xbit.nc",
mode="w",
engine=config["engine"],
unlimited_dims="time",
encoding=enc,
)
end=time.time()
resultdir[compr]=dict(
speed=sourcesize/(end-start+bitinfoend-bitinfostart)/1024/1024,
ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy_xbit.nc")/sourcesize
)
```
%% Cell type:code id:4f95632f-ac07-4192-8a96-b3bb98ebfde2 tags:
``` python
with open(f"results_{str(times)}.csv","a") as f:
for k,v in resultdir.items():
f.write(f"{k}_lossy_xbit,{sourcesize},{v['speed']},{v['ratio']}\n")
```
%% Cell type:markdown id:3d000b88-f995-4369-b87d-98e910c57e9a tags:
### Write the results
%% Cell type:code id:1b054323-937d-4bd8-a785-d3f709a4ee26 tags:
``` python
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f,names=["type","insize","write_speed [Mb/s]","ratio"]) for f in glob.glob("results*.csv")), ignore_index=True)
```
%% Cell type:code id:9c381d31-1156-4e95-ae07-b6cc942a9288 tags:
``` python
df.groupby("type").mean()[["write_speed [Mb/s]","ratio"]].sort_values(by="write_speed [Mb/s]",ascending=False)
```
%% Cell type:code id:802c5d41-2791-4e0d-bf1f-35a1e748c9ed tags:
``` python
!rm test_*compression*.nc
```
%% Cell type:code id:0fd679fe-140a-4ab3-9a09-0349ad6ab821 tags:
``` python
!rm temp.nc
```
%% Cell type:code id:31141e83-3c2c-4b05-a5ee-d070a9f7d004 tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment