Setup for ci

2b2fbd10 · Fabian Wachsmann · d6454428 · 2b2fbd10
Commit 2b2fbd10 authored 1 year ago by Fabian Wachsmann
--- a/notebooks/demo/tutorial_compression_netcdf.ipynb
+++ b/notebooks/demo/tutorial_compression_netcdf.ipynb
@@ -126,12 +126,22 @@
    "pwd=!pwd\n",
    "pwd=pwd[0]\n",
    "source_uncompressed=f\"{pwd}/temp.nc\"\n",
-    "sds=xr.open_mfdataset(source)\n",
+    "sds=xr.open_mfdataset(source).isel(time=slice(1,13))\n",
    "for var in sds.variables:\n",
    "    sds[var].encoding[\"zlib\"]=False\n",
    "sds.to_netcdf(source_uncompressed)"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "6c6bf599-ba32-416f-ad51-1ddb058cbd83",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sds"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -307,7 +317,7 @@
    "import numcodecs\n",
    "rounding = numcodecs.BitRound(keepbits=16)\n",
    "for var in omon2d.data_vars:\n",
-    "    if var == \"time_bnds\":\n",
+    "    if \"bnds\" in var :\n",
    "        continue\n",
    "    omon2d[var].data=rounding.decode(\n",
    "        rounding.encode(\n",
@@ -375,7 +385,7 @@
    "    source_uncompressed,\n",
    "    engine=\"h5netcdf\",\n",
    "    parallel=False\n",
-    ") "
+    ")"
   ]
  },
  {
@@ -408,7 +418,7 @@
    "        print(keepbits)\n",
    "        if keepbits[var][0] > 0 :\n",
    "            print(keepbits[var][0])\n",
-    "            omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits+2)[var] # this one wraps around numcodecs.bitround\n",
+    "            omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits)[var] # this one wraps around numcodecs.bitround\n",
    "bitinfoend=time.time()"
   ]
  },
@@ -490,13 +500,31 @@
   "source": [
    "!rm test_*compression*.nc"
   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fd679fe-140a-4ab3-9a09-0349ad6ab821",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!rm temp.nc"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "31141e83-3c2c-4b05-a5ee-d070a9f7d004",
+   "metadata": {},
+   "outputs": [],
+   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
-   "display_name": "eerie_io",
+   "display_name": "taucenv",
   "language": "python",
-   "name": "eerie_io"
+   "name": "taucenv"
  },
  "language_info": {
   "codemirror_mode": {
@@ -508,7 +536,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.11.4"
  }
 },
 "nbformat": 4,

 %% Cell type:markdown id:6e0b3c7d-b3ac-4013-979a-f4b2012b48f7 tags:

 # Compression methods for NetCDF files

 with Xarray

 This Notebook gives a guideline how to use recent basic lossy and lossless compression for netdf output via `xarray`. It is an implementation of [these slides](https://www.researchgate.net/publication/365006139_NetCDF_Compression_Improvements) using example data of Earth System Models AWI-CM. We will compare the writing speed and the compression ratio of different methods.

 For using lossless compression methods on netcdf files, we use the [hdf5plugin](https://github.com/silx-kit/hdf5plugin/blob/b53d70b702154a149a60f653c49b20045228aa16/doc/index.rst) tool that enables to use additional hdf5filters within python.

 For *lossy* compression, we use the [numcodecs](https://github.com/zarr-developers/numcodecs) lib to calculate the bitrounding.

 We use the [BitShuffle](https://github.com/kiyo-masui/bitshuffle) filter prior to compression whenever possible. This will rearrange the binary data in order to improve compression.

 ## Requirements

 - Lossy compression requires a lot of memory.
 - Reading lossless compressions other than deflated requires netcdf version 4.9 or newer with access to the HDF5 filters

 %% Cell type:markdown id:0f4d7d8d-e120-4d56-9b45-1bfd7eb4ba4b tags:

 ## Lossless compression methods

 ### [zlib]()

 zlib has been the standard compression method since it was introduced in netcdf. It is based on the deflate algorithm, other compression packages use dictionaries.

 ### [Zstandard](https://github.com/facebook/zstd)

 Zstandard allows multithreading. It is used by package registries and is supported by linux systemss. Zstandard offers features beyond zlib/zlib-ng, including better performance and compression.

 ### [LZ4](https://github.com/lz4/lz4)

 LZ4 has a focus on very fast compression. It scales with CPUs.

 ### [Blosc](https://github.com/Blosc/c-blosc)

 Blosc uses the blocking technique to not only reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations. Its default compressor is based on LZ4 and Zstandard.

 %% Cell type:code id:911b4efb-47e2-422b-acf2-a3d8d37b1baf tags:

 ``` python
 import hdf5plugin
 import time
 import fsspec as fs
 import glob
 import xarray as xr
 import tqdm
 ```

 %% Cell type:code id:888676a5-4c4c-4c49-8f97-7144e2bee12d tags:

 ``` python
 help(hdf5plugin)
 ```

 %% Cell type:code id:a598a965-73c3-4700-8de2-4afa26ec3702 tags:

 ``` python
 %store -r times
 #print(times)
 times=0
 ```

 %% Cell type:markdown id:48311eee-fdf4-4e75-9ae6-ace523381037 tags:

 On Levante, you can use the plugins from the CMIP pool directory `/work/ik1017/hdf5plugin/plugins/`:

 %% Cell type:code id:cad827c9-1445-481e-b24b-0017bf78e19b tags:

 ``` python
 hdf5plugin.PLUGIN_PATH="/work/ik1017/hdf5plugin/plugins/"
 %set_env HDF5_PLUGIN_PATH={hdf5plugin.PLUGIN_PATH}
 ```

 %% Cell type:markdown id:c7a91342-9e70-40ba-9389-54fca72bb949 tags:

 We use the ocean surface temperature `tos` in this example:

 %% Cell type:code id:c346fc8b-f437-41e3-9c83-6f2ef49fe749 tags:

 ``` python
 source="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/AWI/AWI-CM-1-1-MR/ssp370/r1i1p1f1/Omon/tos/gn/v20181218/tos_Omon_AWI-CM-1-1-MR_ssp370_r1i1p1f1_gn_201501-202012.nc"
 pwd=!pwd
 pwd=pwd[0]
 source_uncompressed=f"{pwd}/temp.nc"
-sds=xr.open_mfdataset(source)
+sds=xr.open_mfdataset(source).isel(time=slice(1,13))
 for var in sds.variables:
    sds[var].encoding["zlib"]=False
 sds.to_netcdf(source_uncompressed)
 ```

+%% Cell type:code id:6c6bf599-ba32-416f-ad51-1ddb058cbd83 tags:
+
+``` python
+sds
+```
+
 %% Cell type:code id:a2b12787-39e0-45cb-8ebf-b02b95e50b3d tags:

 ``` python
 omon2d=xr.open_mfdataset(
    source_uncompressed,
    engine="h5netcdf",
    parallel=False
 )
 ```

 %% Cell type:markdown id:727f4776-5667-4e74-8c08-cea0855d217c tags:

 The following "*compression* : *configuration*" dictionary is used to configure the `encoding` keyword argument in xarray's *to_netcdf*:

 %% Cell type:code id:8a419d87-1dd1-4b3c-9d48-68c5ced53549 tags:

 ``` python
 comprdict=dict(
    zlib=dict(
        engine="h5netcdf",
        compr=dict(
            zlib=True,
            complevel=5
        )
    ),
    zstd=dict(
        engine="h5netcdf",
        #from python 3.11:
        compr=dict(**hdf5plugin.Bitshuffle(cname="zstd"))
        #compr=dict(**hdf5plugin.Zstd())
    ),
    lz4=dict(
        engine="h5netcdf",
        #from python 3.11:
        compr=dict(**hdf5plugin.Bitshuffle(cname="lz4"))
        #compr=dict(**hdf5plugin.Bitshuffle(lz4=True))
    ),
    blosc=dict(
        engine="h5netcdf",
        compr=dict(**hdf5plugin.Blosc(cname='blosclz', shuffle=1))
    )
 )
 ```

 %% Cell type:code id:34e5bc3f-164e-4c40-b89f-a4e8d5866276 tags:

 ``` python
 comprdict["lz4"]
 ```

 %% Cell type:code id:1c356ce8-7d6f-4d47-a673-e8f1f698799f tags:

 ``` python
 sourcesize=fs.filesystem("file").du(source_uncompressed)
 print(f"The size of the uncompressed source file is {sourcesize/1024/1024} MB")
 ```

 %% Cell type:code id:e6202eb6-0897-4d18-b15e-69dd30b53c23 tags:

 ``` python
 resultdir={}
 for compr,config in tqdm.tqdm(comprdict.items()):
    enc=dict()
    for var in omon2d.data_vars:
        enc[var]=config["compr"]
    start=time.time()
    omon2d.to_netcdf(f"{pwd}/test_{compr}_compression.nc",
                 mode="w",
                 engine=config["engine"],
                 unlimited_dims="time",
                 encoding=enc,
                )
    end=time.time()
    resultdir[compr]=dict(
        speed=sourcesize/(end-start)/1024/1024,
        ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression.nc")/sourcesize
    )
 ```

 %% Cell type:code id:a054a804-0cd6-4f89-9d95-8805d79bac77 tags:

 ``` python
 with open(f"results_{str(times)}.csv","w") as f:
    for k,v in resultdir.items():
        f.write(f"{k},{sourcesize},{v['speed']},{v['ratio']}\n")
 ```

 %% Cell type:markdown id:c1befda5-4bd5-4d1e-b586-8794b0046d85 tags:

 ### Reading not-deflated data

 Before open a file compressed with something else than zlib, you have to import hdf5plugin first:

 %% Cell type:code id:d2f290c4-31a3-48ab-b0a6-c360a8bf171e tags:

 ``` python
 import hdf5plugin
 import xarray as xr
 outf=xr.open_dataset(f"{pwd}/test_zstd_compression.nc",engine="h5netcdf")
 ```

 %% Cell type:code id:6848c002-42d2-47e3-93a5-988995f40208 tags:

 ``` python
 outf
 ```

 %% Cell type:markdown id:d6b06f6c-4e6c-41ef-84c7-f3ef0c011c8f tags:

 ## Lossy

 1. Direct `BitRound`ing with 16 bits to be kept. This precision can be considered as similar to e.g. ERA5 data (24 bit Integer space).
 1. Calculate number of bits with information level 0.99 via *xbitinfo*.

 %% Cell type:code id:bf7139de-3dda-4b9f-a04b-b70d3eb018f8 tags:

 ``` python
 losstimestart=time.time()
 import numcodecs
 rounding = numcodecs.BitRound(keepbits=16)
 for var in omon2d.data_vars:
-    if var == "time_bnds":
+    if "bnds" in var :
        continue
    omon2d[var].data=rounding.decode(
        rounding.encode(
            omon2d[var].load().data
        )
    )
 losstimeend=time.time()
 ```

 %% Cell type:code id:358dc7c4-5186-4265-a5f5-8d7ee7e73c18 tags:

 ``` python
 resultdir={}
 for compr,config in tqdm.tqdm(comprdict.items()):
    enc=dict()
    for var in omon2d.data_vars:
        enc[var]=config["compr"]
    start=time.time()

    omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy.nc",
                 mode="w",
                 engine=config["engine"],
                 unlimited_dims="time",
                 encoding=enc,
                )
    end=time.time()
    resultdir[compr]=dict(
        speed=sourcesize/(end-start+losstimeend-losstimestart)/1024/1024,
        ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy.nc")/sourcesize
    )
 ```

 %% Cell type:code id:bbc7a341-cf54-4073-bf26-c1549ebe44e7 tags:

 ``` python
 with open(f"results_{str(times)}.csv","a") as f:
    for k,v in resultdir.items():
        f.write(f"{k}_lossy,{sourcesize},{v['speed']},{v['ratio']}\n")
 ```

 %% Cell type:markdown id:6cece0ed-0ab0-4fd1-a515-93f36b6753a8 tags:

 ### Xbitinfo

 %% Cell type:code id:8ff2e4f1-10ae-4b0b-9d4f-5ffbf871dbdb tags:

 ``` python
 omon2d=xr.open_mfdataset(
    source_uncompressed,
    engine="h5netcdf",
    parallel=False
 )
 ```

 %% Cell type:code id:09e45692-8bdd-4a78-82eb-9bead71875bb tags:

 ``` python
 import xbitinfo as xb
 ```

 %% Cell type:code id:3c021ba4-7172-4035-9305-ae61e92bb684 tags:

 ``` python
 import time
 bitinfostart=time.time()
 for var in omon2d.data_vars:
    if "bnds" in var:
        continue
    dims=[dim for dim in omon2d[[var]].dims.keys() if "ncell" in dim]
    print(dims)
    if dims:
        bitinfo = xb.get_bitinformation(omon2d[[var]], dim=dims, implementation="python")
        keepbits = xb.get_keepbits(bitinfo, inflevel=0.99)
        print(keepbits)
        if keepbits[var][0] > 0 :
            print(keepbits[var][0])
-            omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits+2)[var] # this one wraps around numcodecs.bitround
+            omon2d[var] = xb.xr_bitround(omon2d[[var]], keepbits)[var] # this one wraps around numcodecs.bitround
 bitinfoend=time.time()
 ```

 %% Cell type:code id:0d2def50-c209-41ee-8604-5fdb8554f640 tags:

 ``` python
 resultdir={}
 for compr,config in tqdm.tqdm(comprdict.items()):
    enc=dict()
    for var in omon2d.data_vars:
        enc[var]=config["compr"]
    start=time.time()

    omon2d.to_netcdf(f"{pwd}/test_{compr}_compression_lossy_xbit.nc",
                 mode="w",
                 engine=config["engine"],
                 unlimited_dims="time",
                 encoding=enc,
                )
    end=time.time()
    resultdir[compr]=dict(
        speed=sourcesize/(end-start+bitinfoend-bitinfostart)/1024/1024,
        ratio=fs.filesystem("file").du(f"{pwd}/test_{compr}_compression_lossy_xbit.nc")/sourcesize
    )
 ```

 %% Cell type:code id:4f95632f-ac07-4192-8a96-b3bb98ebfde2 tags:

 ``` python
 with open(f"results_{str(times)}.csv","a") as f:
    for k,v in resultdir.items():
        f.write(f"{k}_lossy_xbit,{sourcesize},{v['speed']},{v['ratio']}\n")
 ```

 %% Cell type:markdown id:3d000b88-f995-4369-b87d-98e910c57e9a tags:

 ### Write the results

 %% Cell type:code id:1b054323-937d-4bd8-a785-d3f709a4ee26 tags:

 ``` python
 import pandas as pd
 import glob
 df = pd.concat((pd.read_csv(f,names=["type","insize","write_speed [Mb/s]","ratio"]) for f in glob.glob("results*.csv")), ignore_index=True)
 ```

 %% Cell type:code id:9c381d31-1156-4e95-ae07-b6cc942a9288 tags:

 ``` python
 df.groupby("type").mean()[["write_speed [Mb/s]","ratio"]].sort_values(by="write_speed [Mb/s]",ascending=False)
 ```

 %% Cell type:code id:802c5d41-2791-4e0d-bf1f-35a1e748c9ed tags:

 ``` python
 !rm test_*compression*.nc
 ```
+
+%% Cell type:code id:0fd679fe-140a-4ab3-9a09-0349ad6ab821 tags:
+
+``` python
+!rm temp.nc
+```
+
+%% Cell type:code id:31141e83-3c2c-4b05-a5ee-d070a9f7d004 tags:
+
+``` python
+```