how to deal with suboptimal overchunked datasets
I encounter some issues with (at least) ERA5 data that I was generating that relies on basically cdo and also Hyras data that relies (afaik) on cdo and nco.
I am proposing some extra flag in a new branch to deal with this
era5 data
(e.g. psl
) 1 month file of 1hr:
- original chunking: very small chunks all over the time dimesions: 1 timestep == 1 chunk
psl: {'time': 744, 'lat': 640, 'lon': 1280}
* Chunk Sizes: (1, 640, 1280)
* Estimated Chunk Size: 3.12 MB
- after the rechunker is applied:
psl: {'time': 744, 'lat': 640, 'lon': 1280}
* Chunk Sizes: (40, 640, 1280)
* Estimated Chunk Size: 125.00 MB
* Chunks: {'time': (40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 40, 24), 'lat': (640,), 'lon': (1280,)}
as you see is optimally chunked on the time dim but considers 1 chunk for whole lat/lon dim, that would mean that for regional analysis all would need to be loaded.
Alternative that I am proposing in the new branch with a new flag that forces (prioritises) horizontal chunking (-fh
), would result on:
psl: {'time': 744, 'lat': 640, 'lon': 1280}
* Chunk Sizes: (744, 148, 297)
* Estimated Chunk Size: 124.75 MB
* Chunks: {'time': (744,), 'lat': (148, 148, 148, 148, 48), 'lon': (297, 297, 297, 297, 92)}
Hyras:
annual file of 1day data.
- Hyras shows an extreme case of very bad original chunking: 366 temp chunks, 1100 chunks on y, 1 chunk on x:
pr: {'time': 366, 'y': 1100, 'x': 1200}
* Chunk Sizes: (1, 1, 1200)
* Estimated Chunk Size: 0.00 MB
- after chunking it will still show very small chunks (instead of the intented ~126MB):
pr: {'time': 366, 'y': 1100, 'x': 1200}
* Chunk Sizes: (366, 1, 1200)
* Estimated Chunk Size: 1.68 MB
with my alternative (flag --force-horizonzal
):
pr: {'time': 366, 'y': 1100, 'x': 1200}
* Chunk Sizes: (366, 287, 313)
* Estimated Chunk Size: 125.42 MB
the method internally offers leeway: you can define a chunksize in MB, also whether you want similar number of chunks for time,y,x or prioritise only over x,y leaving t as a block (but those are in the internal method, and are unselectable for the moment)