Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • data-infrastructure-services/tutorials-and-use-cases
  • k204228/tutorials-and-use-cases
2 results
Show changes
Commits on Source (4)
Showing
with 871 additions and 294 deletions
......@@ -16,7 +16,7 @@ dependencies:
- netcdf4
- numpy
- pandas
- cartopy>=0.2
# - cartopy>0.20
- scipy
- conda-forge/label/dev::cdo
- python-cdo
......
%% Cell type:markdown id:d8957c6c-7375-4bff-8985-475be95a245f tags:
# Tzis - To Zarr in Swift
`tzis` is a small python package which
1. converts data into the [zarr](https://zarr.readthedocs.io/en/stable/) format and
1. writes it to the DKRZ's cloud storage space [swift](https://swiftbrowser.dkrz.de/)
in one step. It is based on a script which uses [xarray](http://xarray.pydata.org/en/stable/index.html) and the `fsspec` [implementation for swift](https://github.com/d70-t/swiftspec) from Tobias Kölling. `tzis` is optimized for DKRZ's High Performance Computer but can also be used from local computers.
`tzis` features
- writing of **different input file formats**. All files that can be passed to `xarray.open_mfdataset()` can be used.
- **writing** an atomic dataset i.e. one variable covering many files into the cloud per `write_to_swift` call.
- **consolidated stores**. Metadata of many files are saved into one. Conflicting metadata with varying values are combined into a list, e.g. `tracking_id`s.
- **chunking** along the `time` dimension. Datasets without `time` will be written directly ("unmodified") to storage.
- **swift-store** implementation for using basic filesystem-like operations on the object store (like `listdir`)
%% Cell type:markdown id:6f216a5e-dd27-42ad-ad9b-baa4b0434f86 tags:
In this notebook, you will learn
- the [meaning](#define) of `zarr` and the `swift object storage`
- why you can [benefit](#moti) from `zarr` in cloud storage
- [when](#when) it is a good idea to write into cloud
- how to [initializie the swift store](#token) for `tzis` including creating a token
- how to [open and configure](#source) the source dataset
- how to [write](#write) data to swift
- how to [set options](#output) for the zarr output
- how to [access](#access) and use data from swift
- how to work with the [SwiftStore](#swiftstore) similar to file systems
%% Cell type:markdown id:8d6a1c4f-66b5-466a-84b6-8a011ac5bd82 tags:
<a class="anchor" id="define"></a>
## Definition
**Zarr** is a *cloud-optimised* format for climate data. By using *chunk*-based data access, `zarr` enables arrays the can be larger than memory. Both input and output operations can be parallelised. It features *customization* of compression methods and stores.
The **Swift** cloud object storage is a 🔑 *Keyvalue* store where the key is a global unique identifier and the value a representation of binary data. In contrast to a file system 📁 , there are no files or directories but *objects and containers/buckets*. Data access is possible via internet i.e. `http`.
%% Cell type:markdown id:82d19de7-889b-423a-beb2-0ed10972c90e tags:
<a class="anchor" id="moti"></a>
## Motivation
In recent years, object storage systems became an alternative to traditional file systems because of
- **Independency** from computational ressources. Users can access and download data from anywhere without the need of HPC access or resources
- **Scalability** because no filesystem or system manager has to care about the connected disks.
- **A lack of storage** space in general because of increasing model output volume.
- **No namespace conflicts** because data is accessed via global unique identifier
Large Earth System Science data bases like the CMIP Data Pool at DKRZ contain [netCDF](https://github.com/Unidata/netcdf-c) formatted data. Access and transfers of such data from an object storage can only be conducted on file level which results in heavy download volumes and less reproducible workflows.
The cloud-optimised climate data format [Zarr](https://zarr.readthedocs.io/en/stable/) solves these problems by
- allowing programs to identify _chunks_ corresponding to the desired subset of the data before the download so that the **volume of data transfer is reduced**.
- allowing users to access the data via `http` so that both **no authentication** or software on the cloud repository site is required
- saving **meta data** next to the binary data. That allows programs to quickly create a virtual representation of large and complex datasets.
Zarr formatted data in the cloud makes the data as *analysis ready* as possible.
With `tzis`, we developed a package that enables to use DKRZ's insitutional cloud storage as a back end storage for Earth System Science data. It combines `swiftclient` based scripts, a *Zarr storage* implementation and a high-level `xarray` application including `rechunking`. Download velocity can be up to **400 MB/s**. Additional validation of the data transfer ensures its completeness.
%% Cell type:markdown id:87d9ffaf-55c4-4013-b961-3bfb94d9c99a tags:
<a class="anchor" id="when"></a>
## Which type of data is suitable?
Datasets in the cloud are useful if
- they are *fixed*. Moving data in the cloud is very inefficient.
- they will not be *prepended*. Data in the cloud can be easily *appended* but *prepending* most likely requires moving which is not efficient.
- they are *open*. One advantage comes from the easy access via http. This is even easier when useres do not have to log in.
%% Cell type:markdown id:77feab5f-a512-4449-95c7-4daa0762f25f tags:
<a class="anchor" id="token"></a>
## Swift authentication and initialization
Central `tzis` functions require that you specify an `OS_AUTH_TOKEN` which allows the program to connect to the swift storage with your credentials. This token is valid for a month per default. Otherwise, you would have to login for each new session. When you work with `swift`, this token is saved in the hidden file `~/.swiftenv` which contains the following paramter
- `OS_STORAGE_URL` which is the URL associated with the storage space of the project or the user. Note that this URL cannot be opened like a *swiftbrowser* link but instead it can be used within programs like `tzis`.
- `OS_AUTH_TOKEN`.
**Be careful** with the token. It should stay only readable for you. Especially, do not push it into git repos.
%% Cell type:markdown id:aa1c7b7f-6c8e-44db-9c29-7bb9eee88a84 tags:
<a class="anchor" id="token"></a>
### Get token and url
`Tzis` includes a function to get the token or, if not available, create the token:
```python
import tzis
token=tzis.get_token(ACCOUNT, USERNAME=USERNAME)
from tzis import swifthandling
token=swifthandling.get_token(
"dkrz",
project,
user
)
```
When calling `get_token`,
1. it tries to read in the configuration file `~/.swiftenv`
1. if there is a file, it checks, if the found configuration matches the specified *account*
1. if no file was found or the configuration is invalid, it will create a token
1. it asks you for a password
1. it writes two files: the `~/.swiftenv` with the configuration and `~/.swiftenv_useracc` which contains the account and user specification for that token.
1. it returns a dictionary with all configuration variables
%% Cell type:markdown id:13396f94-6da3-4eb9-9602-1045fc9540c5 tags:
### Initializing an output container
After having the authentication for swift, we *initialize* a swift container in which we will save the data. We do that with
```python
container = tzis.Tzis(os_url, os_token, os_container,
os_name=None, mf_dset=None, varname=None, verbose=False,
xarray_kwargs=None)
target_fsmap=swifthandling.get_swift_mapper(
token["OS_STORAGE_URL"],
token["OS_AUTH_TOKEN"],
container_name,
os_name=prefix_for_object_storage
)
```
The mandatory arguments are:
- `os_url` is the `OS_STORAGE_URL`
- `os_token` is the `OS_AUTH_TOKEN`
- `os_container` is the *container name* / the *bucket*. A container is the highest of two store levels in the swift object store.
these will connect you to the swift store and initialize/create a container.
You can
- already specify a `os_name` which is the *zarr dataset name* or the *object* name where the data will be in the end.
- decide whether you want to run the write process in *Verbose* mode by specifying `verbose=True`
E.g.:
```python
container = tzis.Tzis(token["OS_STORAGE_URL"], token["OS_AUTH_TOKEN"], "tzistest",
verbose=True)
```
%% Cell type:code id:68d2c5f4-f248-401b-b300-85a37f748f49 tags:
``` python
from tzis import tzis
help(tzis.Tzis)
```
%% Cell type:markdown id:c2380bfc-22ab-41c0-96a6-aa9ce2203b13 tags:
### Setting a zarr dataset name (an object prefix)
You can switch to different zarr dataset output names within one container by overwriting the container's `store` attribute:
```python
container.open_store(os_name):
```
%% Cell type:markdown id:9ec4e347-4d16-4916-8cd0-355ddd512fe2 tags:
<a class="anchor" id="source"></a>
## Open and configure the source dataset
We can now read in the source file(s) which should be written into cloud. We have to save them as the `mf_dset` variable of our variable "`container`". These input files will be opened with `xarray`'s `open_mfdataset` function. Therefore, the `tzis` function looks similar:
Tzis offers a convenient function to directly open a dataset such that it has the chunks fitting to target chunk size. See the *Writing to Swift*-chapter for notes related to the chunking.
```python
def open_mf_dataset(self, mf, varname, xarray_kwargs=None):
from tzis import openmf
omo = openmf.open_mfdataset_optimize(
glob_path_var,
varname,
target_fsmap,
chunkdim=chunkdim,
target_mb=target_mb
)
```
The mandatory arguments are
- `mf`: The dataset file(s). A `str` or a `list` of source files which can be opened with
- `glob_path_var`: The dataset file(s). A `str` or a `list` of source files which can be opened with
```python
mf_dset = xarray.open_mfdataset(mf,
mf_dset = xarray.open_mfdataset(glob_path_var,
decode_cf=True,
use_cftime=True,
data_vars='minimal',
coords='minimal',
compat='override',
combine_attrs="drop_conflicts")
```
- `varname`: The variable from the dataset which will be selected and then written into the object store
- `target_fsmap`
E.g.:
```python
path_to_dataset = "/mnt/lustre02/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
path_to_dataset = "/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
mfs_towrite=[path_var +filename for filename in os.listdir(path_to_dataset)]
container.mf_dataset=container.open_mf_dataset(mfs_towrite, "pr")
container.mf_dataset=container.open_mf_dataset(openmf, "pr", target_fsmap)
container.mf_dataset
```
%% Cell type:markdown id:6736e17c-c4e5-4393-8ef9-13590c2397fe tags:
### Grib input
If you want to use `grb` input files, you can specify `cfgrib` as an **engine** for `xarray`.
```python
container.open_mf_dataset(list_of_grib_files, "pr", xarray_kwargs=**dict(engine="cfgrib"))
```
%% Cell type:markdown id:76171636-1f6c-453b-942e-c62d2b49467d tags:
<a class="anchor" id="write"></a>
## Writing to swift
After we have initialized the container and opened the dataset, we can **write** it into cloud. The conversion to `zarr` is made on the way. We can specify all necessary configuration options within the `write` function:
```python
def write_to_swift(self, chunkdim='time', target_mb=1000, startchunk=0, validity_check=False, maxretries=3)
def write_zarr(
self,
fsmap,
mf_dset,
varname,
chunkdim="time",
target_mb=0,
startchunk=0,
validity_check=False,
maxretries=3,
trusted=True,
)
```
The function needs
- a target store `fsmap` as a *fsspec mapping*
- the input xarray dataset `mf_dset`
- the variable name `varname` which should be used to rechunk
The function allows you
- to set `chunkdim` which is the *dimension* used for chunking. There is yet no other dimension than "time" possible.
- to set the target size of a data chunk. A *chunk* corresponds to an object in the swift object storage. It has limitations on both sides: Chunks smaller than 10 MB are not efficient while sizes larger than 2GB are not supported.
- to set the `startchunk`. If the write process was interrupted - e.g. because your dataset is very large, you can specify at which chunk the write process should restart.
- to set the number of *retries* if the transfer is interrupted.
- to set `validity_check=True` which will validate the transfer after having the data completly transferred. This checks if the data in the chunks are equal to the input data.
E.g.
```python
outstore=container.write_to_swift()
from tzis import tzis
outstore=tzis.write_zarr(
omo.target_fsmap,
omo.mf_dset,
omo.varname,
verbose=True,
target_mb=0
)
```
%% Cell type:markdown id:9f045270-f61d-450d-8bc5-dd9a725c7dfb tags:
The output `outstore` of `write_to_swift` is a new variable for the output **zarr storage**. Packages like `xarray` which are using `zarr` can identify and open the *consolidated* dataset from the cloud with that store. The `os_name` of `container` can now be changed while the `outstore` still points to the written `os_name`.
The output `outstore` of `write_zarr` is a new variable which packages like `xarray` can use and open as a *consolidated* dataset. The `os_name` of `container` can now be changed while the `outstore` still points to the written `os_name`.
%% Cell type:markdown id:30fda29b-036a-4a8c-bf35-0421b1cad34e tags:
### Overwriting or appending?
`write_to_swift()` per default **appends** data if possible. It calls `xarray`'s `to_zarr()` function *for each chunk*. Before a chunk is written, it is checked if there is already a chunk for exactly the **slice** of the dataset that should be written. If so, the chunk is skipped. Therefore, recalling `write_to_swift` only overwrites chunks if they cover a different slice of the source dataset.
`write_zarr()` per default **appends** data if possible. It calls `xarray`'s `to_zarr()` function *for each chunk*. Before a chunk is written, it is checked if there is already a chunk for exactly the **slice** of the dataset that should be written. If so, the chunk is skipped. Therefore, recalling `write_zarr` only overwrites chunks if they cover a different slice of the source dataset.
In order to skip chunks, you can set `startchunk`. Then, the function will jump to `startchunk` and start writing this.
%% Cell type:markdown id:e33d8816-18bc-4cff-86b9-5cfac67de7de tags:
### Writing another variable from the same dataset
1. Define another store by using a different `os_name`:
```python
container.store= container.open_store(os_name):
omo.target_fsmap= swifthandling.get_swift_mapper(
token["OS_STORAGE_URL"],
token["OS_AUTH_TOKEN"],
container_name,
os_name=new_prefix_for_new_variable
)
```
1. Set another variable name `varname`:
2. Set another variable name `varname`:
```python
container.varname=varname
omo.varname=varname
```
1. Write to swift:
3. Write to swift:
```python
container.write_to_swift()
tzis.write_zarr()
```
%% Cell type:markdown id:5aed5c16-bfee-49d9-8693-5d3a38893bee tags:
### Writing another dataset into the same container
You do not have to login to the same store and the same container a second time. You can still use the `container` variable. Just restart at [upload](#upload).
%% Cell type:markdown id:cb4d8781-5314-4a55-a301-1300b4a94667 tags:
## Options and configuration for the zarr output
%% Cell type:markdown id:5230a651-4f6d-4c12-a0d1-bb9bb790877d tags:
### Memory and chunk size
%% Cell type:markdown id:e494d109-82fa-448b-ac0d-ce4f77565949 tags:
### Compression
[From Zarr docs:](https://zarr.readthedocs.io/en/v2.10.2/tutorial.html#compressors)
> If you don’t specify a compressor, by default Zarr uses the [Blosc](https://github.com/Blosc) compressor. Blosc is generally very fast and can be configured in a variety of ways to improve the compression ratio for different types of data. Blosc is in fact a *“meta-compressor”*, which means that it can use a number of different compression algorithms internally to compress the data. A list of the internal compression libraries available within Blosc can be obtained via:
```python
from numcodecs import blosc
blosc.list_compressors()
['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
```
> The default compressor can be changed by setting the value of the zarr.storage.default_compressor variable, e.g.:
```python
import zarr.storage
from numcodecs import Zstd, Blosc
# switch to using Zstandard
zarr.storage.default_compressor = Zstd(level=1)
```
> A number of different compressors can be used with Zarr. A separate package called [NumCodecs](http://numcodecs.readthedocs.io/) is available which provides a common interface to various compressor libraries including Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA. Different compressors can be provided via the compressor keyword argument accepted by all array creation functions.
%% Cell type:markdown id:afa8df5d-0101-4ed9-9063-f7ce1ba404c9 tags:
### Attributes
*Attributes* of the dataset are handled in a `dict`ionary in the `container.mf_dset` variable via `xarray`. You can **add** or **delete** attributes just like items from a dictionary:
```python
#add an attribute
mf_dset.attrs["new_attribute"]="New value of attribute"
print(mf_dset.attrs["new_attribute"])
omo.attrs["new_attribute"]="New value of attribute"
print(omo.attrs["new_attribute"])
#delete the attribute
del mf_dset.attrs["new_attribute"]
del omo.attrs["new_attribute"]
```
%% Cell type:markdown id:b4e19440-3fd9-406f-80e2-752070e2e060 tags:
<a class="anchor" id="access"></a>
## Access and use your Zarr dataset
1. You can open the *consolidated zarr datasets* with `xarray` using an URL-prefix-like string constructed as
```python
zarrinput=OS_STORAGE_URL+"/"+os_container+"/"+os_name
xarry.open_zarr(zarrinput, consolidated=True, decode_times=True)
```
This is possible if the container is *public*.
1. If your container is *private*, you have to use a `zarr storage` where you have to login with credentials to the store first. I.e., you can also do
```python
zarr_dset = xarray.open_zarr(container.store, consolidated=True, decode_times=True)
zarr_dset
```
1. You can download data from the [swiftbrowser](https://swiftbrowser.dkrz.de) manually
%% Cell type:markdown id:c976d55c-502d-47ab-b3ef-67842f6aea11 tags:
### Coordinates
Sometimes, you have to *reset* the coordinates because it gets lost on the transfer to zarr:
```python
precoords = set(
["lat_bnds", "lev_bnds", "ap", "b", "ap_bnds", "b_bnds", "lon_bnds"]
)
coords = [x for x in zarr_dset.data_vars.variables if x in precoords]
zarr_dset = zarr_dset.set_coords(coords)
```
%% Cell type:markdown id:91a744a5-eb11-4d21-a2be-9f7ac7284c21 tags:
### Reconvert to NetCDF
The basic reconversion to netCDF can be done with `xarray`:
```python
written.to_netcdf(outputfilename)
```
#### Compression and encoding:
Often, the original netCDF was compressed. You can set different compressions in an **encoding** dictionary. For using `zlib` and its compression level 1, you can set:
```python
var_dict = dict(zlib=True, complevel=1)
encoding = {var: var_dict for var in written.data_vars}
```
#### FillValue
`to_netcdf` might write out *FillValue*s for coordinates which is not compliant to CF. In order to prevent that, set an encoding as follows:
```python
coord_dict = dict(_FillValue=False)
encoding.update({var: coord_dict for var in written.coords})
```
#### Unlimited dimensions
Last but not least, one key element of netCDF is the **unlimited dimension**. You can set a *keyword argument* in the `to_netcdf` command. E.g., for rewriting a zarr-CMIP6 dataset into netCDF, consider compression and fillValue in the encoding and run
```python
written.to_netcdf("testcase.nc",
format="NETCDF4_CLASSIC",
unlimited_dims="time",
encoding=encoding)
```
%% Cell type:markdown id:1eff5a21-b2dd-43b6-9e00-88779638b6aa tags:
<a class="anchor" id="swiftstore"></a>
## Container handling with the swiftstore - `chmod`, `ls`, `rm`, `mv`
## Swift storage handling with fsspec - `chmod`, `ls`, `rm`, `mv`
The mapper from fsspec comes with a *filesystem* object named `fs` which maps the api calls to the linux commands so that they become applicable, e.g.:
```python
outstore.fs.ls(outstore.root)
```
%% Cell type:markdown id:39f029d4-9efd-442d-8262-9ded55cf0a3d tags:
### Saving a list of all zarr datasets
### The Index
`write_zarr` automatically appends to an index *INDEX.csv* in the parent directory. You should find it via
```python
import os
outstore.fs.ls(os.path.dirname(outstore.root))
```
You can directly read that with
You can get all `os_name`s of a container with `listdir` of the `zarrstore`:
```python
#remove the os_name so that you go up on container level:
container.open_store("")
#container.store is now changed
all_zarr_datasets = container.listdir()
all_zarr_datasets
import pandas as pd
index_df=pd.read_csv(os.path.dirname(outstore.root)+"/INDEX.csv")
```
In case the data is free, you should save the list as follows:
All the *url*s in the column *url* should be openable with xarray, e.g.:
```python
with fopen("zarrsets.txt", "w") as f:
for os_name in all_zarr_datasets:
f.write(OS_STORAGE_URL+"/"+os_container+"/"+os_name)
import xarray as xr
xr.open_zarr(index_df["url"][0], consolidated=True)
```
This will enable to [simply open](#access) the zarr datasets with `xarray` afterwards.
%% Cell type:markdown id:03a2f2e1-a4d7-4016-abe9-224663271a40 tags:
### How to make a container public
- use the `store`:
```python
tzis.toggle_public(outstore)
#with a container and a prefix, you can get the container_name via os
#import os
#container_name=os.path.basename(os.path.dirname(outstore.root))
swifthandling.toggle_public(container_name)
```
This will either make the container of the outstore *public* if it was not or it will make it *private* by removing all access control lists if it was public. Note that only container as a whole can be made public or private.
- With hand:
1. Log in at https://swiftbrowser.dkrz.de/login/ .
2. In the line of the target container, click on the arrow on the right side with the red background and click on `share`.
3. Again, click on the arrow on the right side and click on `make public`.
%% Cell type:markdown id:966d03c4-74a0-4f63-87ed-49ba6f4b29ae tags:
### Remove a zarr-`store` i.e. all objects with `os_name` prefix
- use the `store`:
- use `fsspec`:
```python
container.open_store("")
container.store.rmdir(os_name)
target_fsmap.fs.rmdir(os_name)
```
- With hand:
1. Log in at https://swiftbrowser.dkrz.de/login/ .
2.
- On the line of the target container, click on the arrow on the right side and click on `Delete container`.
- Click on the target container and select the store to be deleted. Click on the arrow on the right side and click on `Delete`.
%% Cell type:code id:56bf4fef-58eb-413f-833a-20594b515fb4 tags:
``` python
```
%% Cell type:code id:2c7aa866-70f2-4fe0-ae02-742671357701 tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections
⌛ **time_estimation**: "30min"
☑️ **requirements**: `intake_esm.__version__ == 2021.8.17`, at least 10GB memory.
☑️ **requirements**: `intake_esm.__version__ >= 2023.4.*`, at least 10GB memory.
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [Motivation of intake-esm](#motivation)
1. [Features of intake and intake-esm](#features)
1. [Browse through catalogs](#browse)
1. [Data access via intake-esm](#dataaccess)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a>
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html).
## Motivation of intake-esm
> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
> `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**.
%% Cell type:markdown id: tags:
For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:
- 🌍 where data is saved
- 🪧 how data is saved
- 📤 how data should be loaded
but still can search, discover, access and load data of a project.
%% Cell type:markdown id: tags:
<a class="anchor" id="features"></a>
## Features of intake and intake-esm
Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.
- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without
- being next to the data (e.g. logged in on dkrz's luv)
- knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁
%% Cell type:markdown id: tags:
All required information for searching, accessing and loading the catalog's data is configured within the catalogs:
- 🌍 where data is saved
* users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
* data of different platforms (cloud or disk) can be combined in one catalog
* on mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved
* users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
* catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤 how data should be loaded
* users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
* with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded
%% Cell type:markdown id: tags:
In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.
%% Cell type:markdown id: tags:
<a class="anchor" id="terminology"></a>
## Terminology: **Catalog**, **Catalog file** and **Collection**
We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:
- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a>
- is a `.yaml` file
- can be opened with `open_catalog`, e.g.:
```python
intake.open_catalog(["https://dkrz.de/s/intake"])
```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. There are [many driver](https://intake.readthedocs.io/en/latest/plugin-directory.html) libraries for intake. <a class="anchor" id="intakedriver"></a>.
%% Cell type:markdown id: tags:
- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
- a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
- an **asset** is most often a file. <a class="anchor" id="asset"></a>
- a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a>
- the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json
"catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
```
```{note}
The term *collection* is often used synonymically for [catalog](#catalog).
```
%% Cell type:markdown id: tags:
- a *intake-esm* **catalog** 📖 consists of a `.json` file (the **description**) and the underlying data base. The data base is either provided within the `.json` file or as a `.csv.gz` formatted list.
The intake-esm catalog can be opened with intake-esm's function `intake.open_esm_datastore()` where the `.json` part is the argument, e.g:
```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json")
```
%% Cell type:code id: tags:
``` python
#note that intake_esm is imported with `import intake` as a plugin
import intake
#to find out the version of intake-esm, you can do:
import intake_esm
intake_esm.__version__
```
%% Cell type:markdown id: tags:
<a class="anchor" id="browse"></a>
## Open and browse through catalogs
intake (not intake-esm) **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
<mark> You only need to remember one URL as the *single point of access* for DKRZ's intake catalogs: The DKRZ top level catalog can be accessed via dkrz.de/s/intake . Intake will only follow this *redirect* if a specific parser is activated. This can be done by providing the url in a list.</mark>
%% Cell type:code id: tags:
``` python
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#dkrz_catalog=intake.open_catalog(["/pool/data/Catalogs/dkrz_catalog.yaml"])
#
#only for the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/intake2022/esm-collections/cloud-access/dkrz_catalog.yaml"])
```
%% Cell type:code id: tags:
``` python
c6=dkrz_catalog["dkrz_cmip6_disk"]
```
%% Cell type:markdown id: tags:
```{note}
Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**.
```
%% Cell type:markdown id: tags:
We can look into the catalog with `print` and `list`
%% Cell type:markdown id: tags:
Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.
%% Cell type:code id: tags:
``` python
list(dkrz_catalog)
```
%% Cell type:markdown id: tags:
All these catalogs are **intake-esm** catalogs. You can find this information via the `_entries` attribute. The line `plugin: ['esm_datastore']
` refers to **intake-esm**'s function `open_esm_datastore()`.
%% Cell type:code id: tags:
``` python
print(dkrz_catalog._entries)
```
%% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]`
where
- **project** can be one of the *model intercomparison project*, e.g. `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. Working next to the data on the file system will be the fastest way possible.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*
%% Cell type:markdown id: tags:
**Why that convention?**:
- **dkrz**: Assume you work with internation collections. Than it may become important that you know from where the data comes, e.g. if only pathes on a local file system are given as the locations of the data.
- **project**: Project's data standards differ from each other so that different catalog attributes are required to identify a single asset in a project data base.
- **store**: Intake-esm cannot load data from all stores. Before data from the archive can be accessed, it has to be retrieved. Therefore, the opening function is not working for catalog merged for all stores.
%% Cell type:markdown id: tags:
**Best practice for naming catalogs**:
- Use small letters for all values
- Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
%% Cell type:markdown id: tags:
We could directly start to work with **two intake catalog** at the same time.
Let's have a look into a master catalog of [Pangeo](https://pangeo.io/):
%% Cell type:code id: tags:
``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
```
%% Cell type:code id: tags:
``` python
pangeo
```
%% Cell type:code id: tags:
``` python
list(pangeo)
```
%% Cell type:markdown id: tags:
While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply:
%% Cell type:code id: tags:
``` python
pangeo.climate
```
%% Cell type:markdown id: tags:
Pangeo's ESM collections are one level deeper in the catalog tree:
%% Cell type:code id: tags:
``` python
list(pangeo.climate)
```
%% Cell type:markdown id: tags:
### The `intake-esm` catalogs
We now look into a catalog which is opened by the plugin `intake-esm`.
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
%% Cell type:markdown id: tags:
```{note}
The project catalogs contain only valid and current project data. They are constantly updated.
If your work is based on a catalog and a subset of the data from it, be sure to save that subset so you can later compare your database to the most current catalog.
```
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col)
```
%% Cell type:markdown id: tags:
`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.
The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.1 Mio Files** in 2022.
%% Cell type:markdown id: tags:
The data base is loaded into an underlying `panda`s dataframe which we can access with `esm_col.df`. `esm_col.df.head()` displays the first rows of the table:
%% Cell type:code id: tags:
``` python
esm_col.df.head()
```
%% Cell type:markdown id: tags:
We can find out details about `esm_col` with the object's attributes. `esm_col.esmcol_data` contains all information given in the `JSON` file. We can also focus on some specific attributes.
%% Cell type:code id: tags:
``` python
#esm_col.esmcol_data
dict(esm_col.esmcat)["description"]
```
%% Cell type:code id: tags:
``` python
print("What is this catalog about? \n" + esm_col.esmcol_data["description"])
print("What is this catalog about? \n" + dict(esm_col.esmcat)["description"])
#
print("The link to the data base: "+ esm_col.esmcol_data["catalog_file"])
print("The link to the data base: "+ dict(esm_col.esmcat)["catalog_file"])
```
%% Cell type:markdown id: tags:
Advanced: To find out how many datasets are available, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one):
%% Cell type:code id: tags:
``` python
cat = esm_col.df.drop(['uri','time_range'],1).drop_duplicates(keep="first")
cat = esm_col.df.drop(['uri','time_range'],axis=1).drop_duplicates(keep="first")
print(len(cat))
```
%% Cell type:code id: tags:
``` python
def pieplot(gbyelem) :
#groupby, sort and select the ten largest
global c6
size = c6.df.groupby([gbyelem]).size().sort_values(ascending=False)
size10 = size.nlargest(10)
#Sum all others as 10th entry
size10[9] = sum(size[9:])
size10.rename(index={size10.index.values[9]:'all other'},inplace=True)
#return a pie plot
return size10.plot.pie(figsize=(18,8),ylabel='',autopct='%.2f', fontsize=16)
```
%% Cell type:code id: tags:
``` python
pieplot("source_id")
```
%% Cell type:markdown id: tags:
### Browse through the data of the ESM collection
%% Cell type:markdown id: tags:
You will browse the collection technically by setting values the **column names** of the underlying table. Per default, the catalog was loaded with all cmip6 attributes/columns that define the CMIP6 data standard:
%% Cell type:code id: tags:
``` python
esm_col.df.columns
```
%% Cell type:markdown id: tags:
These are configured in the top level catalog so you <mark> do not need to open the catalog to see the columns </mark>
%% Cell type:code id: tags:
``` python
dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args
```
%% Cell type:markdown id: tags:
Most of the time, we want to set more than one attribute for a search. Therefore, we define a query `dict`ionary and use the `search` function of the `esm_col` object. In the following case, we look for temperature at surface in monthly resolution for 3 different experiments:
%% Cell type:code id: tags:
``` python
query = dict(
variable_id="tas",
table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"])
# piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years.
# historical = historical Simulation, 1850-2014
# ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100
cat = esm_col.search(**query)
```
%% Cell type:code id: tags:
``` python
cat
```
%% Cell type:markdown id: tags:
We could also use *Wildcards*. For example, in order to find out which ESMs of the institution *MPI-M* have produced data for our subset:
%% Cell type:code id: tags:
``` python
cat.search(source_id="MPI-ES*")
```
%% Cell type:markdown id: tags:
We can find out which models have submitted data for at least one of them by:
%% Cell type:code id: tags:
``` python
cat.unique(["source_id"])
cat.unique()["source_id"]
```
%% Cell type:markdown id: tags:
If we instead look for the models that have submitted data for ALL experiments, we use the `require_all_on` keyword argument:
%% Cell type:code id: tags:
``` python
cat = esm_col.search(require_all_on=["source_id"], **query)
cat.unique(["source_id"])
cat.unique()["source_id"]
```
%% Cell type:markdown id: tags:
Note that only the combination of a `variable_id` and a `table_id` is unique in CMIP6. If you search for `tas` in all tables, you will find many entries more:
%% Cell type:code id: tags:
``` python
query = dict(
variable_id="tas",
# table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"])
cat = esm_col.search(**query)
cat.unique(["table_id"])
cat.unique()["table_id"]
```
%% Cell type:markdown id: tags:
Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data.
%% Cell type:markdown id: tags:
### How to load more columns
Intake allows to load only a subset of the columns that is inside the **intake-esm** catalog. Since the memory usage of **intake-esm** is high, the default columns are only a subset from all possible columns. Sometimes, other columns are of interest:
If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The *opendap_url* is an *additional* column that can also be loaded.
We can define 3 different column name types for the usage of intake catalogs:
1. **Default** attributes which are loaded from the main catalog and which can be seen via `_entries[CATNAME]._open_args`.
2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata["parameters"]["cataloonie_columns"]`. With these template attributes, there may be redundancy in the columns. They exist to simplify merging catalogs across projects.
3. **Additional** attributes which are not necessary to identify a single asset but helpful for users. You can find these via
`dkrz_catalog.metadata["parameters"]["additional_PROJECT_columns"]`
So, for CMIP6 there are:
%% Cell type:code id: tags:
``` python
dkrz_catalog.metadata["parameters"]["additional_cmip6_columns"]
dkrz_catalog.metadata["parameters"]["additional_cmip6_disk_columns"]
```
%% Cell type:markdown id: tags:
```{tip}
You may find *variable_id*s in the catalog which are not obvious or abbrevations for a clear variable name. In that cases you would need additional information like a *long_name* of the variable. For CMIP6, we provided the catalog with this `long_name` so you could add it as a column.
```
%% Cell type:markdown id: tags:
So, this is the instruction how to open the catalog with additional columns:
1. create a combination of all your required columns:
%% Cell type:code id: tags:
``` python
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]+["opendap_url"]
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["read_csv_kwargs"]["usecols"]+["opendap_url"]
```
%% Cell type:markdown id: tags:
2. open the **dkrz_cmip6_disk** catalog with the `csv_kwargs` keyword argument in this way:
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))
esm_col=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cols))
```
%% Cell type:markdown id: tags:
- ⭐ The customization of catalog columns allows highest flexibility for intake users.
- ⭐ In theory, we could add many more columns with additional information because ot all have to be loaded from the data base.
%% Cell type:markdown id: tags:
```{warning}
The number of columns determines the required memory.
```
%% Cell type:markdown id: tags:
```{tip}
If you work from remote and also want to access the data remotely, load the *opendap_url* column.
```
%% Cell type:markdown id: tags:
<a class="anchor" id="dataaccess"></a>
## Access and load data of the ESM collection
With the power of `xarray`, `intake` can load your subset into a `dict`ionary of datasets. We therefore focus on the data of `MPI-ESM1-2-LR`:
%% Cell type:code id: tags:
``` python
#case insensitive?
query = dict(
variable_id="tas",
table_id="Amon",
source_id="MPI-ESM1-2-LR",
experiment_id="historical")
cat = esm_col.search(**query)
cat
```
%% Cell type:markdown id: tags:
You can find out which column intake uses to access the data via the following keyword:
%% Cell type:code id: tags:
``` python
print(cat.path_column_name)
cat.esmcat.assets.column_name
```
%% Cell type:markdown id: tags:
As we are working with the *_disk* catalog, **uri** contains *pathes* to the files on filesystem. If you are working from remote, you would have
- to change the catalog's attribute `path_column_name` to *opendap_url*.
- to reassign the `format` column from *netcdf* to *opendap*
as follows:
%% Cell type:code id: tags:
``` python
#cat.path_column_name="opendap_url"
#cat.esmcat.assets.column_name="opendap_url"
#newdf=cat.df.copy()
#newdf.loc[:,"format"]="opendap"
#cat.df=newdf
```
%% Cell type:markdown id: tags:
**Intake-ESM** natively supports the following data formats or access formats (since opendap is not really a file format):
- netcdf
- opendap
- zarr
You can also open **grb** data but right now only by specifying xarray's attribute *engine* in the *open* function which is defined in the following. I.e., it does not make a difference if you specify **grb** as format.
You can find an example in the *era5* notebook.
%% Cell type:markdown id: tags:
The function to open data is `to_dataset_dict`.
We recommend to set a keyword argument `cdf_kwargs` for the chunk size of the variable's data array. Otherwise, `xarray` may choose too large chunks. Most often, your data contains a time dimension so that you could set `cdf_kwargs={"chunks":{"time":1}}`.
We recommend to set a keyword argument `xarray_open_kwargs` for the chunk size of the variable's data array. Otherwise, `xarray` may choose too large chunks. Most often, your data contains a time dimension so that you could set `xarray_open_kwargs={"chunks":{"time":1}}`.
If your collection contains **zarr** formatted data, you need to add another keyword argument `zarr_kwargs`. <mark> The trick is: You can just specify both. Intake knows from the `format` column which *kwargs* should be taken.
If your collection contains **zarr** formatted data, you need to add another keyword argument `zarr_kwargs`.
**Unfortunately, this has changed in new versions**:
- DEPRICATED The trick is: You can just specify both. Intake knows from the `format` column which *kwargs* should be taken.
%% Cell type:code id: tags:
``` python
help(cat.to_dataset_dict)
```
%% Cell type:code id: tags:
``` python
xr_dict = cat.to_dataset_dict(cdf_kwargs=dict(chunks=dict(time=1)),
zarr_kwargs=dict(consolidated=True)
#decode_times=True,
#use_cftime=True)
)
xr_dict = cat.to_dataset_dict(
#xarray_open_kwargs=
#dict(
# chunks=dict(
# time=60
# ),
#consolidated=True
#decode_times=True,
#use_cftime=True
#)
)
xr_dict
```
%% Cell type:markdown id: tags:
`Intake` was able to aggregate many files into only one dataset:
- The `time_range` column was used to **concat** data along the `time` dimension
- The `member_id` column was used to generate a new dimension
The underlying `dask` package will only load the data into memory if needed. Note that attributes which disagree from file to file, e.g. *tracking_id*, are excluded from the dataset.
%% Cell type:markdown id: tags:
How **intake-esm** should open and aggregate the assets is configured in the *aggregation_control* part of the description:
%% Cell type:code id: tags:
``` python
print(esm_col.esmcol_data["aggregation_control"]["aggregations"])
cat.esmcat.aggregation_control.aggregations
```
%% Cell type:markdown id: tags:
Columns can be defined for appending or creating new dimensions. The *options* are keyword arguments for xarray.
They **keys** of the dictionary are made with column values defined in the *aggregation_control* of the **intake-esm** catalog. These will determine the **key_template**. The corresponding commands are:
%% Cell type:code id: tags:
``` python
print(cat.esmcol_data["aggregation_control"]["groupby_attrs"])
print(cat.esmcat.aggregation_control.groupby_attrs)
#
print(cat.key_template)
```
%% Cell type:markdown id: tags:
You can work with these keys **directly** on the **intake-esm** catalog which will give you an overview over all columns (too long for the web page):
%% Cell type:code id: tags:
``` python
#cat["CMIP.MPI-ESM1-2-HR.historical.Amon.gn"]
```
%% Cell type:markdown id: tags:
If we are only interested in the **first** dataset of the dictionary, we can *pop it out*:
%% Cell type:code id: tags:
``` python
xr_dset = xr_dict.popitem()[1]
xr_dset
```
%% Cell type:markdown id: tags:
### Troubleshooting
The variables are collected in **one** dataset. This requires that **the dimensions and coordinates must be the same over all files**. Otherwise, xarray cannot merge these together.
For CMIP6, most of the variables collected in one **table_id** should be on the same dimensions and coordinates. Unfortunately, there are exceptions.:
- a few variables are requested for *time slices* only.
- sometimes models use different dimension names from file to file
Using the [preprocessing](https://tutorials.dkrz.de/tutorial_intake-4-preprocessing-derived-vars.html#use-preprocessing-when-opening-assets-and-creating-datasets) keyword argument can help to rename dimensions before merging.
For Intake providers: the more information on the dimensions and coordinates provided already in the catalog, the better the aggregation control.
%% Cell type:markdown id: tags:
### Pangeo's data store
Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before.
%% Cell type:code id: tags:
``` python
pangeo_cmip6=pangeo.climate.cmip6_gcs
pangeo.climate.discover()
```
%% Cell type:code id: tags:
``` python
import intake
pangeo_cmip6=intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")
cat = pangeo_cmip6.search(**query)
cat
```
%% Cell type:markdown id: tags:
There are differences between the collections because
- Pangeo provides files in *consolidated*, `zarr` formatted datasets which correspond to `zstore` entries in the catalog instead of `path`s or `opendap_url`s.
- The `zarr` datasets are already aggregated over time so there is no need for a `time_range` column
If we now open the data with `intake`, we have to specify keyword arguments as follows:
%% Cell type:code id: tags:
``` python
dset_dict = cat.to_dataset_dict(
zarr_kwargs={"consolidated": True}#, "decode_times": True, "use_cftime": True}
)
```
%% Cell type:code id: tags:
``` python
dset_dict
```
%% Cell type:markdown id: tags:
`dset_dict` and `xr_dict` are the same. You succesfully did the intake tutorial!
%% Cell type:markdown id: tags:
### Making a quick plot
The following line exemplifies the ease of the intake's data processing library chain. On the web page, the interactivity will not work as all plots would have to be loaded which is not feasible.
For more examples, check out the **use cases** on that web page.
%% Cell type:code id: tags:
``` python
import hvplot.xarray
xr_dset["tas"].hvplot.quadmesh(width=600)
xr_dset["tas"].squeeze().hvplot.quadmesh(width=600)
```
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Intake III - work with two catalogs and merge them
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Intermediate-orange.svg)
🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow
⌛ **time_estimation**: "60min"
☑️ **requirements**:
- 20GB memory
- `intake_esm.__version__ == 2021.8.17`, at least 10GB memory.
- `intake_esm.__version__ == 2023.4.*`, at least 10GB memory.
- [pandas](https://pandas.pydata.org/)
- [xarray](http://xarray.pydata.org/en/stable/)
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
Based on DKRZ's CMIP6 and CORDEX catalogs and a Pangeo's CMIP6 catalog, you learn in this part,
> how to **merge catalogs** by combining their data bases with *differently formatted assets*.
This includes
1. [Loading catalogs with a user-defined set of attributes](#load)
1. [Comparing meta data for checking for compatibility](#compare)
1. Merging the data bases via [merge](#merge) or [concat](#concat)
1. [Configure a catalog description for accessing datasets across projects](#across)
1. [Make ALL data accessible and consolidate aggregation](#access)
1. [Save the new catalog](#save)
```
%% Cell type:markdown id: tags:
```{admonition} Questions
:class: questions
- how can you find out how **compatible** the catalogs are? Would you have to sanitize the column names?
- what is overlap? Which of the 1000 datasets of pange are included in DKRZ's?
```
%% Cell type:markdown id: tags:
This tutorial highlights two use cases:
1. Merging two projects (CMIP6 and CORDEX, both from DKRZ)
1. Merging two data bases for the same project (CMIP6 DKRZ and CMIP6 Pangeo)
For each, the ultimate **Goal** of this tutorial is to create a merged catalog which also enables data access to data sets of both catalogs.
%% Cell type:markdown id: tags:
## Case 1: Merge two projects CMIP6 and CORDEX in one catalog
%% Cell type:code id: tags:
``` python
import intake
import pandas as pd
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#only for generating the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
```
%% Cell type:code id: tags:
``` python
print([entry for entry in list(dkrz_catalog) if "disk" in entry and ("cordex" in entry or "cmip6" in entry)])
```
%% Cell type:markdown id: tags:
<a class="anchor" id="load"></a>
### Load catalogs with default + common columns
Most of all DKRZ catalogs include [cataloonies](https://tutorials.dkrz.de/tutorial_intake-1-2-dkrz-catalogs.html) attributes. This simplifies the merging as you could already merge the catalogs over these columns. Usable columns of the catalogs are stored in the main catalog's metadata and can be displayed and retrieved:
%% Cell type:code id: tags:
``` python
dkrz_catalog.metadata
```
%% Cell type:code id: tags:
``` python
overall_columns=dkrz_catalog.metadata["parameters"]["cataloonie_columns"]["default"]
print(overall_columns)
```
%% Cell type:markdown id: tags:
However, these attributes are not sufficient for finding an individual assets in CORDEX and CMIP6. We need *additional* columns:
%% Cell type:code id: tags:
``` python
cordex_columns=dkrz_catalog._entries["dkrz_cordex_disk"]._open_args["csv_kwargs"]["usecols"]
cordex_columns=dkrz_catalog._entries["dkrz_cordex_disk"]._open_args["read_csv_kwargs"]["usecols"]
print(cordex_columns)
cmip6_columns=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]
cmip6_columns=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["read_csv_kwargs"]["usecols"]
print(cmip6_columns)
```
%% Cell type:markdown id: tags:
We open both catalogs with the columns that we have found:
%% Cell type:code id: tags:
``` python
cmip6_cat=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cmip6_columns+overall_columns))
cmip6_cat=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cmip6_columns+overall_columns))
```
%% Cell type:code id: tags:
``` python
cordex_cat=dkrz_catalog.dkrz_cordex_disk(csv_kwargs=dict(usecols=cordex_columns+overall_columns))
cordex_cat=dkrz_catalog.dkrz_cordex_disk(read_csv_kwargs=dict(usecols=cordex_columns+overall_columns))
```
%% Cell type:markdown id: tags:
We assume that we are interested in the variable **tas** which is the *Near-Surface Temperature*:
%% Cell type:code id: tags:
``` python
cmip6_cat=cmip6_cat.search(variable_id="tas")
cordex_cat=cordex_cat.search(variable_id="tas")
```
%% Cell type:markdown id: tags:
<a class="anchor" id="merge"></a>
### Merge both catalogs
The underlying *DataFrames* have different columns. We add CMIP6 columns to the CORDEX catalog and vice versa so that we can merge:
%% Cell type:code id: tags:
``` python
for cordex_col in list(set(cordex_columns)-set(overall_columns)):
cmip6_cat._df.loc[:,cordex_col]="None"
cmip6_cat.df.loc[:,cordex_col]="None"
for cmip6_col in list(set(cmip6_columns)-set(overall_columns)):
cordex_cat._df.loc[:,cmip6_col]="None"
cordex_cat.df.loc[:,cmip6_col]="None"
```
%% Cell type:code id: tags:
``` python
for column in overall_columns+cmip6_columns+cordex_columns :
cmip6_cat._df[column]=cmip6_cat._df[column].astype(str)
cordex_cat._df[column]=cordex_cat._df[column].astype(str)
cmip6_cat.df[column]=cmip6_cat.df[column].astype(str)
cordex_cat.df[column]=cordex_cat.df[column].astype(str)
```
%% Cell type:code id: tags:
``` python
overall_df=pd.merge(cmip6_cat._df, cordex_cat._df, on=overall_columns+cmip6_columns+cordex_columns, how="outer")
overall_df=pd.merge(cmip6_cat.df, cordex_cat.df, on=overall_columns+cmip6_columns+cordex_columns, how="outer")
```
%% Cell type:code id: tags:
``` python
overall_df
```
%% Cell type:markdown id: tags:
We use the cmip6 catalog as the merged catalog and reset the dataframe underneath via:
%% Cell type:code id: tags:
``` python
cmip6_cat._df=overall_df
```
%% Cell type:markdown id: tags:
<a class="anchor" id="across"></a>
### Redefine catalog description
We copy the entire `.json` description file so that we can edit it.
%% Cell type:code id: tags:
``` python
mixed_esmcol_data=cmip6_cat.esmcol_data.copy()
mixed_esmcol_data=dict(cmip6_cat.esmcat).copy()
mixed_esmcol_data["aggregation_control"]=dict(mixed_esmcol_data["aggregation_control"]).copy()
```
%% Cell type:code id: tags:
``` python
mixed_esmcol_data
```
%% Cell type:markdown id: tags:
Let's have a look at these entries. We can subdivide these into two groups:
1. Required to be manually changed:
- **groupby_attrs**: We will change it such that both CMIP6 and CORDEX datasets can be created.
- **attributes** and **default_columns**: The CORDEX ones need to be added
- **aggregation_control**: Must be revised but we can do that afterwards. For now, we will just delete all entries but the one for `variable_id`
2. Other attributes
- **assets**: Will stay the same as there is no difference between the original catalogs
- **catalog_file**: Will be automatically overwritten by Intake when the final catalog is written.
- **Description**, **esmcat_version**, **id**: Is arbitrary
We will start with adding missing **attributes**:
%% Cell type:code id: tags:
``` python
columns_already=[k["column_name"] for k in mixed_esmcol_data["attributes"]]
columns_already=[dict(k)["column_name"] for k in mixed_esmcol_data["attributes"]]
```
%% Cell type:code id: tags:
``` python
columns_already
```
%% Cell type:code id: tags:
``` python
for k in cordex_cat.esmcol_data["attributes"] :
if k["column_name"] not in columns_already:
for k in dict(cordex_cat.esmcat)["attributes"] :
if dict(k)["column_name"] not in columns_already:
mixed_esmcol_data["attributes"].append(k)
```
%% Cell type:markdown id: tags:
**groupby_attrs**:
he attributes used to build an index for a dataset is defined by the order of attributes in the list **groupby_attrs**. The aggregation methods for CMIP6 datasets and CORDEX datasets *differ*.
We have to redefine this list. Think about the perfect order and arrangement of attributes.
%% Cell type:code id: tags:
``` python
mixed_esmcol_data["aggregation_control"]["groupby_attrs"]=[
"CORDEX_domain",
"driving_model_id",
"activity_id",
"institute_id",
"model_id",
"experiment_id",
"frequency",
"table_id",
"grid_label"
]
```
%% Cell type:markdown id: tags:
**aggregation_control**
For now, drop all the aggregation attributes besides `variable_id` for enabling a quick save of the catalog. Note that the grouping only works if there is at least one entry in the `mixed_esmcol_data["aggregation_control"]["aggregations"]` list.
%% Cell type:code id: tags:
``` python
for entry in mixed_esmcol_data["aggregation_control"]["aggregations"]:
if entry["attribute_name"] != "variable_id" :
if dict(entry)["attribute_name"] != "variable_id" :
mixed_esmcol_data["aggregation_control"]["aggregations"].remove(entry)
```
%% Cell type:code id: tags:
``` python
mixed_esmcol_data
```
%% Cell type:markdown id: tags:
Let's set the redefined *dictionary* as the the `esmcol_data`:
`NaN`s cause trouble in the df, so that we set it to *notset*:
%% Cell type:code id: tags:
``` python
cmip6_cat.esmcol_data=mixed_esmcol_data
for k in mixed_esmcol_data["aggregation_control"]["groupby_attrs"]:
overall_df[overall_df[k].isna()]="notset"
overall_df[k]=overall_df[k].str.replace("None","notset")
overall_df[k]=overall_df[k].str.replace("nan","notset")
```
%% Cell type:markdown id: tags:
Now, let us open the combined intake catalog:
%% Cell type:code id: tags:
``` python
cmip6andcordex=intake.open_esm_datastore(
obj=dict(
esmcat=mixed_esmcol_data,
df=overall_df
)
)
```
%% Cell type:markdown id: tags:
We write the new catalog to disk via:
%% Cell type:code id: tags:
``` python
cmip6_cat.serialize("test", catalog_type="file")
cmip6andcordex.serialize("test", catalog_type="file")
```
%% Cell type:markdown id: tags:
We can test if our configuration works by directly opening it:
%% Cell type:code id: tags:
``` python
intake.open_esm_datastore("test.json").search(experiment_id="historical",
source_id="MPI*",
simulation_id="r1i1*")#.to_dataset_dict(cdf_kwargs=dict(chunks=dict(time=1)))
```
%% Cell type:markdown id: tags:
## Case 2: Merge two data bases for CMIP6
Assume you are interested in variable `tas` from table `Amon` from both catalogs.
You would start look like this:
%% Cell type:code id: tags:
``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
#
cmip6_pangeo=pangeo.climate.cmip6_gcs
print(list(pangeo.climate))
cmip6_pangeo=intake.open_esm_datastore("https://storage.googleapis.com/cmip6/pangeo-cmip6.json")
cmip6_cat=dkrz_catalog.dkrz_cmip6_disk
#
esm_dkrz_tas=cmip6_cat.search(
variable_id="tas",
table_id="Amon"
)
esm_pangeo_tas=cmip6_pangeo.search(
variable_id="tas",
table_id="Amon"
)
```
%% Cell type:code id: tags:
``` python
print(esm_dkrz_tas)
```
%% Cell type:code id: tags:
``` python
print(esm_pangeo_tas)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="compare"></a>
Let's
### Compare the Metadata
1. Both catalogs follow the [esmcat-specs](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) which can be seen from the following entry:
%% Cell type:code id: tags:
``` python
print(esm_dkrz_tas.esmcol_data["esmcat_version"])
print(esm_pangeo_tas.esmcol_data["esmcat_version"])
print(dict(esm_dkrz_tas.esmcat)["esmcat_version"])
print(dict(esm_pangeo_tas.esmcat)["esmcat_version"])
```
%% Cell type:markdown id: tags:
2. As both catalogs follow the esmcat standard, they have a list of `attributes` which we can compare: Indeed, they have exactly the same attributes/columns. In the following, we use pandas DataFrames for better displays:
%% Cell type:code id: tags:
``` python
import pandas as pd
esm_dkrz_atts_df=pd.DataFrame(esm_dkrz_tas.esmcol_data["attributes"])
esm_pangeo_atts_df=pd.DataFrame(esm_pangeo_tas.esmcol_data["attributes"])
esm_dkrz_atts_df=pd.DataFrame([dict(k) for k in dict(esm_dkrz_tas.esmcat)["attributes"]])
esm_pangeo_atts_df=pd.DataFrame([dict(k) for k in dict(esm_pangeo_tas.esmcat)["attributes"]])
esm_dkrz_atts_df
```
%% Cell type:code id: tags:
``` python
esm_pangeo_atts_df
```
%% Cell type:code id: tags:
``` python
esm_pangeo_atts_df.equals(esm_dkrz_atts_df)
```
%% Cell type:markdown id: tags:
When working with both catalogs, you would notice that the pangeo's do not use a prefix character 'v' for the values of *version* however dkrz does. We fix that with:
%% Cell type:code id: tags:
``` python
esm_pangeo_tas.df["version"]= "v" + esm_pangeo_tas.df["version"].astype(str)
```
%% Cell type:markdown id: tags:
3. The data format: The pangeo catalog contains zarr datasets stored in the google cloud storage while dkrz's catalog allows different formats by providing a column named *format*. When we combine these catalogs, we have to consider the different formats
%% Cell type:code id: tags:
``` python
print(esm_dkrz_tas.data_format,esm_dkrz_tas.format_column_name)
esm_pangeo_tas.esmcat.assets.format
```
%% Cell type:code id: tags:
``` python
print(esm_pangeo_tas.data_format,esm_pangeo_tas.format_column_name)
print(
esm_dkrz_tas.df["format"].unique(),
esm_dkrz_tas.esmcat.assets.format_column_name
)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="concat"></a>
### Combine the databases with the underlying DataFrames
This is a workflow for creating a merged data base:
1. Find all common column names/keys that are in both data bases.
1. Create a filtered Catalog
1. Setting common columns as index in both catalogs
1. Throw out indices in one catalog that are in both.
1. Concat the filtered catalog with the reference catalog.
Let us start with 1.:
%% Cell type:code id: tags:
``` python
keys = [key
for key in esm_dkrz_tas.df.columns.values
if key in esm_pangeo_tas.df.columns.values
]
keys
```
%% Cell type:markdown id: tags:
We continue with 2.:
> 2. Create a filtered Catalog
We create a multi-index with all common keys with `set_index` and save these in new variables `i1` and `i2`. These can be used as a filter. The `~` sign reverses the condition in the filter:
%% Cell type:code id: tags:
``` python
i1 = esm_pangeo_tas.df.set_index(keys).index
i2 = esm_dkrz_tas.df.set_index(keys).index
esm_pangeo_tas_filtered=esm_pangeo_tas.df[~i1.isin(i2)]
```
%% Cell type:markdown id: tags:
And finally, 3.
> 3. Concat the filtered catalog with the reference catalog.
We use pandas `concat` function and *ignore the indices* of both catalogs.
%% Cell type:code id: tags:
``` python
esm_merged=pd.concat([esm_dkrz_tas.df, esm_pangeo_tas_filtered],
ignore_index=True)
esm_merged
```
%% Cell type:markdown id: tags:
<a class="anchor" id="access"></a>
### Make ALL data accessible and consolidate aggregation
Intake enables to load assets of different formats. For that,
1. the data base must have a column which describes the **format** of the asset.
1. only one column contains the information how to **access** the asset needs to be merged. In our example, the `zstore` column and the `uri` column needs to be merged into one common column. We name that `uri`.
1. the **assets** entry in the catalog description needs to be adapted to the new configuration.
We start with
1. creating a 'format' column which is *zarr* if there is no entry in *uri* and *netcdf* in all other cases.
%% Cell type:code id: tags:
``` python
esm_merged["format"]="netcdf"
esm_merged.loc[pd.isna(esm_merged["uri"]),"format"]="zarr"
esm_merged.loc[pd.isna(esm_merged["time_range"]),"time_range"]="*"
```
%% Cell type:markdown id: tags:
2. We merge the *zstore* and *uri* columns in a new column *uri*. As we need individual values of the asset, we have to loop over the rows.
```{note}
Whenever you can, you should omit using `iterrows` because it is rather slow.
```
%% Cell type:code id: tags:
``` python
esm_merged.loc[pd.isna(esm_merged["uri"]),"uri"]=esm_merged["zstore"]
```
%% Cell type:code id: tags:
``` python
del esm_merged["zstore"]
```
%% Cell type:markdown id: tags:
3. We now create a new description.
This will be based on the dkrz catalog. We use that because it has the aggregation over time which we want to maintain.
The *assets* entry now does not have a direct description of the **format** but instead a specification of a **format_column_name**. Also, the *column_name* is *uri* instead of *path*:
%% Cell type:code id: tags:
``` python
new_cat_json=esm_dkrz_tas.esmcol_data.copy()
new_cat_json=dict(esm_dkrz_tas.esmcat).copy()
```
%% Cell type:code id: tags:
``` python
new_cat_json["assets"]={
"column_name":"uri",
"format_column_name":"format"
}
```
%% Cell type:code id: tags:
``` python
new_cat_json["id"]="Merged dkrz-pangeo cmip6 subset catalog"
```
%% Cell type:markdown id: tags:
In order to make *zarr stores* compatible with the aggregation over time, we have to fill in a dummy value in *time_range*:
%% Cell type:code id: tags:
``` python
esm_merged.loc[pd.isna(esm_merged["time_range"]),"time_range"]="*"
```
%% Cell type:markdown id: tags:
<a class="anchor" id="save"></a>
### Save the new catalog
Let us test the new catalog first. We can open the new catalog by providing two arguments to `open_esm_datastore`:
- the data base **esm_merged**
- the catalog description **new_cat_json**
Afterwards, we search for a subset which is in both
%% Cell type:code id: tags:
``` python
esm_merged_cat=intake.open_esm_datastore(esm_merged, new_cat_json)
esm_merged_cat=intake.open_esm_datastore(
dict(
esmcat=new_cat_json,
df=esm_merged
)
)
```
%% Cell type:code id: tags:
``` python
esm_merged_cat_test=esm_merged_cat.search(activity_id="ScenarioMIP",
member_id="r1i1p1f1",
grid_label="gn",
source_id=["MPI-ESM1-2-HR","CAS-ESM2-0"])
```
%% Cell type:markdown id: tags:
Since we have two different formats in the catalog, we have to provide keyword arguments for both formats within the `to_dataset_dict` function.
- `zarr_kwargs={"consolidated":True}` is needed because Pangeo's zarr assets have *consolidated* metadata
- `cdf_kwargs={"chunks":{"time":1}}` configures dask to not use very large arrays
%% Cell type:code id: tags:
``` python
test_dsets=esm_merged_cat_test.to_dataset_dict(
zarr_kwargs={"consolidated":True},
cdf_kwargs={"chunks":{"time":1}}
)
```
%% Cell type:markdown id: tags:
That worked fine. Now we save the catalog with `serialize`. We will separate the catalog into two files, the database `.csv.gz` file and the descriptor `.json` file. We can do that by passing the `catalog_type` keyword argument:
%% Cell type:code id: tags:
``` python
esm_merged_cat.serialize(name="our_catalog", catalog_type="file")
```
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Intake IV - preprocessing and derived variables
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-expert-red.svg)
🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow
⌛ **time_estimation**: "30min"
☑️ **requirements**: `intake_esm.__version__ == 2021.8.17`, at least 10GB memory.
☑️ **requirements**: `intake_esm.__version__ == 2023.4.*`, at least 10GB memory.
- intake I
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
Based on DKRZ's CMIP6 catalog, you learn in this part how to
1. [add a **preprocessing** to `to_dataset_dict()`](#preprocess)
1. [create a derived variable registry](#derived)
```
%% Cell type:code id: tags:
``` python
import intake
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk
```
%% Cell type:markdown id: tags:
<a class="anchor" id="preprocessing"></a>
## Use Preprocessing when opening assets and creating datasets
When calling intake-esm's `to_dataset_dict` function, we can pass an argument **preprocess**. Its value should be a function which is applied to all assets before they are opened.
```{note}
For CMIP6, a [preprocessing package](https://github.com/jbusecke/cmip6_preprocessing) has been developped for homogenizing and preparing datasets of different ESMs for a grand analysis featuring
- renaming and setting of coordinates
- adjusting grid values to fit into a common range (0-360 for lon)
```
E.g., if you would like to set some specific variables as coordinates, you can define a [function](https://github.com/jbusecke/cmip6_preprocessing/blob/209041a965984c2dc283dd98188def1dea4c17b3/cmip6_preprocessing/preprocessing.py#L239) which
- receives an xarray dataset as an argument
- returns a new xarray dataset
%% Cell type:code id: tags:
``` python
def correct_coordinates(ds) :
"""converts wrongly assigned data_vars to coordinates"""
ds = ds.copy()
for co in [
"x",
"y",
"lon",
"lat",
"lev",
"bnds",
"lev_bounds",
"lon_bounds",
"lat_bounds",
"time_bounds",
"lat_verticies",
"lon_verticies",
]:
if co in ds.variables:
ds = ds.set_coords(co)
return ds
```
%% Cell type:markdown id: tags:
Now, when you open the dataset dictionary, you provide it for *preprocess*:
%% Cell type:code id: tags:
``` python
cat=esm_dkrz.search(variable_id="tas",
table_id="Amon",
source_id="MPI-ESM1-2-HR",
member_id="r1i1p1f1",
experiment_id="ssp370"
)
test_dsets=cat.to_dataset_dict(
zarr_kwargs={"consolidated":True},
cdf_kwargs={"chunks":{"time":1}},
preprocess=correct_coordinates
)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="derived"></a>
## Derived variables
Most of the following is taken from the [intake-esm tutorial](https://intake-esm.readthedocs.io/en/latest/how-to/define-and-use-derived-variable-registry.html).
> A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset. [...] Derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.
The registry of the derived variables can be connected to the catalog. When users open
%% Cell type:code id: tags:
``` python
import intake
import intake_esm
```
%% Cell type:code id: tags:
``` python
from intake_esm import DerivedVariableRegistry
```
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:output_scroll
# Intake V - create intake-esm catalog from scratch
%% Cell type:markdown id: tags:output_scroll
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Intermediate-orange.svg)
🎯 **objectives**: Learn how to create `intake-esm` ESM-collections
⌛ **time_estimation**: "60min"
☑️ **requirements**: `intake_esm.__version__ == 2021.8.17`, at least 10GB memory.
☑️ **requirements**: `intake_esm.__version__ == 2023.4.*`, at least 10GB memory.
- intake I, intake II
- [pandas](https://pandas.pydata.org/)
- [json](https://de.wikipedia.org/wiki/JavaScript_Object_Notation)
- [xarray](http://xarray.pydata.org/en/stable/)
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [When to build `intake-esm` collections](#motivation)
1. [How to create a standardized intake-esm catalog from scratch](#create)
1. [How to equip the catalog with attributes and configurations for assets and aggregation](#description)
2. [How to add the collection of assets to the catalog](#database)
1. [How to validate and save the newly created catalog](#validate)
1. [How to configure the catalog to process multivariable assets](#multi)
```
%% Cell type:markdown id: tags:
**Intake** is a cataloging tool for data repositories. It opens catalogs with *driver*s. Drivers can be plug-ins like `intake-esm`.
This tutorial gives insight into the creation of a **intake-esm catalogs**. We recommend this specific driver for intake when working with ESM-data as the plugin allows to load the data with the widely used and accepted tool `xarray`.
```{note}
This tutorial creates a catalog from scratch. If you work based on another catalog, it might be sufficient for you to look into [intake II - save subset]()
```
%% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a>
## 1. When should I create an `intake-esm` catalog?
Cataloging your data set with a *static* catalog for *easy access* is beneficial if
- the data set is *stable* 🏔 such that you do not have to update the content of the catalog to make it usable at all
- the data set is *very large* 🗃 such that browsing and accessing data via file system is less performant
- the data set should be *shared* 🔀 with many people such that you cannot use a data base format
%% Cell type:markdown id: tags:
<a class="anchor" id="create"></a>
## 2. Create an intake-esm catalog which complies to [esmcat-specs]((https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md))
In order to create a well-defined, helpful catalog, you have to answer the following questions:
- What should be *search facetts* of the catalog?
- How are [assets](#asset) of the catalog combined to a dataset?
- How should `xarray` open the data set?
%% Cell type:markdown id: tags:
For `intake-esm` catalogs, an early [standard](https://github.com/NCAR/esm-collection-spec/blob/master/collection-spec/collection-spec.md) has been developped to ensure compatibility across different `intake-esm` catalogs. We will follow those specs in this tutorial.
In the code example, we will use a python dictionary in this example but you could also write directly into a file with your favorite editor. We start with a catalog dictionary `intake_esm_catalog` and add the required basic meta data:
%% Cell type:code id: tags:
``` python
intake_esm_catalog={
# we follow the esmcat specs version 0.1.0:
'esmcat_version': '0.1.0',
'id': 'Intake-esmI',
'description': "This is an intake catalog created for the intake tutorial"
}
```
%% Cell type:markdown id: tags:
<a class="anchor" id="description"></a>
### 2.1. Create the description
The description contains all the meta data which is necessary to understand the catalog. That makes the catalog *self-descriptive*. It also includes configuration for intake how to load assets of the data set(s) with the specified driver.
%% Cell type:markdown id: tags:
<a class="anchor" id="defineatts"></a>
#### Define **attributes** of your catalog
The catalog's [collection](#collection) uses attributes to describe the assets. These attributes are defined in the description via python `dict`ionaries and given as a list in the `intake-esm` catalog `.json` file, e.g.:
```json
"attributes": [
{
"column_name": "activity_id",
"vocabulary": "https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_activity_id.json"
},
]
```
and will be accessed by users from the loaded catalog variable `catalog` via:
```python
catalog.esmcol_data["attributes"]
catalog.esmcat.attributes
```
%% Cell type:markdown id: tags:
Catalog's attributes should allow users to
- effectively **browse**:
- The in-memory representation and the visulation tool for the catalog is a `Pandas` *DataFrame*. By specifying a `column_name`, the columns of the *DataFrame* are generated by the attributes of the catalog.
- Additionally, the `column_name` of the catalog's attributes can be used as *search facetts* - they will be keyword arguments of the `catalog.search()` function
- **understand** the content: You can provide information to the attributes, e.g. by specifying a `vocabulary` for all available (or allowed) values of the attribute.
➡ The [collection](#collection) must have values for all defined attributes (see 3.2.)
➡ In other terms: If [assets](#assets) should be integrated into the catalog, they have to be described with these attributes.
%% Cell type:markdown id: tags:
```{admonition} Best Practise
:class: tip
- The best configuration is reached if all datasets can be *uniquely identified*. I.e., if the users fill out all search facets, they will end up with **only one** dataset.
- Do not exaggerate with supply of additional columns. Users may be confused when many search fields have similar meanings. Also, the display of the DataFrame should fit into the window width.
```
%% Cell type:markdown id: tags:
**Use case: Catalog for project data on a file system**
Given a more than one level directory tree, ensure that:
- All files are on the same and deepest directory level.
- Each directory level has the same meaning across the project data. E.g. the deepest directory can have the meaning **version**.
This can easily be done by creating a **directory structure template** and check against their definitions.
If that is approved, each directory level can be used as an catalog's attribute.
%% Cell type:code id: tags:
``` python
attributes=[]
directory_structure_template="mip_era/activity_id/institution_id/source_id/experiment_id/member_id/table_id/variable_id/grid_label/version"
```
%% Cell type:code id: tags:
``` python
for att in directory_structure_template.split('/'):
attributes.append(
dict(column_name=att,
vocabulary=f"https://raw.githubusercontent.com/WCRP-CMIP/CMIP6_CVs/master/CMIP6_{att}.json"
)
)
intake_esm_catalog["attributes"]=attributes
print(intake_esm_catalog)
```
%% Cell type:markdown id: tags:
```{note}
For data managemant purposes in general, we highly recoomend to define a <i>path_template</i> and a <i>filename_template</i> for a clear directory structure **before storing any data**.
```
%% Cell type:markdown id: tags:
* You can add more attributes from files or by parsing filenames
%% Cell type:markdown id: tags:
#### Define the **assets** column of your catalog
The `assets` entry is a python `dict`ionary in the catalog similiar to an attribute, e.g.:
```json
"assets": {
"column_name": "path",
"format": "netcdf"
},
```
The assets of a catalog refer to the data source that can be loaded by `intake`. Assets are essential for connecting `intake`'s function of **browsing** with the function of **accessing** the data. It contains
- a `column_name` which is associated with the keyword in the [collection](#collection). The value of `column_name` in the collection points at the [asset](#asset) which can be loaded by `xarray`.
- the entry `format` specifies the dataformat of the asset.
```{note}
If you have [assets](#asset) of mixed types, you can substitute `format` by `format_column_name` so that both information for the asset is taken from the [collection](#collection)
```
%% Cell type:code id: tags:
``` python
assets={
"column_name": "path",
"format": "netcdf"
}
intake_esm_catalog["assets"]=assets
print(intake_esm_catalog)
```
%% Cell type:markdown id: tags:
#### Optional: Define **aggregation control** for your data sets
```{note}
If **aggregation_control** is not defined, intake opens one xarray dataset per asset
```
One goal of a catalog is to the make access of the data as **analysis ready** as possible. Therefore, `intake-esm` features aggregating multiple [assets](#asset) to a larger single **data set**. If **aggregation_control** is defined in the [catalog](#catalog) and users run the catalog's `to_dataset_dict()` function, a Python dictionary of aggregated xarray datasets is created. The logic for merging and/or concatenating the catalog into datasets has to be configured under aggregation_control.
The implementation works such that the variable's dimensions are either enhanced by a new dimension or an existing dimension is extended with new data included in the addtional [assets](#asset).
%% Cell type:markdown id: tags:
- `aggregation_control` is a `dict`ionary in the [catalog](#catalog). If it is set, three keywords have to be configured:
- `variable_column_name`: In the [collection](#collection), the **variable name** is specified under that column. Intake-esm will aggregate [assets](#asset) with the same name only. Thus, all [assets](#asset) to be combined to a dataset have to include at least one unique variable. If your [assets](#asset) contain more than one data variable and users should be able to subset with intake, check [multi variable assets](#multivar).
- `groupby_attrs`: [assets](#asset) attributed with different values of the `groupby_attrs` **should not be aggregated** to one xarray dataset. E.g., if you have data for different ESMs in one catalog you do not want users to merge them into one dataset. The `groupby_attrs` will be combined to the key of the aggregated dataset in the returned dictionary of `to_dataset_dict()`.
- `aggregations`: Specification of **how** xarray should combine [assets](#asset) with same values of these `groupby_attrs`. <a class="anchor" id="aggregations"></a>
. E.g.:
```json
"aggregation_control": {
"variable_column_name": "variable_id",
"groupby_attrs": [
"activity_id",
"institution_id"
],
"aggregations": [
{
"type": "union",
"attribute_name": "variable_id"
}
]
}
```
%% Cell type:markdown id: tags:
Let's start with defining `variable_column_name` and `groupby_attrs`:
%% Cell type:code id: tags:
``` python
aggregation_control=dict(
variable_column_name="variable_id",
groupby_attrs=[
"activity_id",
"institution_id"
]
)
```
%% Cell type:markdown id: tags:
```{admonition} Best Practise
:class: tip
- A well-defined aggregation control contains **all** [defined attributes](#defineatts)
```
%% Cell type:markdown id: tags:
[**Aggregations**](#aggregations):
*aggregations* is an optional list of dictionaries each of which configures
- on which dimension of the variable the [assets](#assets) should be aggregated
- optionally: what keyword arguments should be passed to xarray's `concat()` and `merge()` functions
for one attribute/column of the catalog given as `attribute_name`.
%% Cell type:markdown id: tags:
A dictionary of the aggregations list is named *aggregation object* and has to include three specifications:
- `attribute_name`: the column name which is not a *groupby_attr* and should be used for aggregating a single variable over a dimension
- `type`: Can either be
- `join_new`:
- join_existing
- union
- **optional**: `options`: Keyword arguments for xarray
%% Cell type:markdown id: tags:
The following defines that `variable_id` will be taken for a unique dataset:
%% Cell type:code id: tags:
``` python
aggregation_control["aggregations"]=[dict(
attribute_name="variable_id",
type="union"
)]
```
%% Cell type:markdown id: tags:
Now, we configure intake to use `time` for extending the existing dimension `time`. Therefore, we have to add `options` with "dim":"time" as keyword argument for xarray:
%% Cell type:code id: tags:
``` python
aggregation_control["aggregations"].append(
dict(
attribute_name="time_range",
type="join_existing",
options={ "dim": "time", "coords": "minimal", "compat": "override" }
)
)
```
%% Cell type:markdown id: tags:
We can also, kind of retrospectively, combine all *member* of an ensemble on a new dimension of a variable:
%% Cell type:code id: tags:
``` python
aggregation_control["aggregations"].append(
dict(
attribute_name= "member_id",
type= "join_new",
options={ "coords": "minimal", "compat": "override" }
)
)
```
%% Cell type:markdown id: tags:
```{note}
It is not possible to pre-configure `dask` options for `xarray`. Be sure that users of your catalog know if and how to set <b>chunks</b>.
```
%% Cell type:code id: tags:
``` python
intake_esm_catalog["aggregation_control"]=aggregation_control
print(intake_esm_catalog)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="database"></a>
### 2.2. Create the data base for the catalog
The [collection](#collection) of [assets](#asset) can be specified either
- under `catalog_dict` as a list of dictionaries inside the [catalog](#catalog). One asset including all attribute specifications is saved as an individual dictionary, e.g.:
```json
"catalog_dict": [
{
"filename": "/work/mh0287/m221078/prj/switch/icon-oes/experiments/khwX155/outdata/khwX155_atm_mon_18500101.nc",
"variable": "tas_gmean"
}
]
```
- or under `catalog_file` which refers to a separate `.csv` file, e.g.
```json
"catalog_file": "dkrz_cmip6_disk_netcdf.csv.gz"
```
%% Cell type:markdown id: tags:
### Option A: Catalog_dict implementation
Assuming, we would like to create a catalog for all files in `/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/`, we can parse the path with our `directory_structure_template`:
%% Cell type:code id: tags:
``` python
trunk="/work/ik1017/CMIP6/data/CMIP6/ScenarioMIP/DKRZ/MPI-ESM1-2-HR/ssp370/r1i1p1f1/Amon/tas/gn/v20190710/"
trunkdict={}
for i,item in enumerate(directory_structure_template.split('/')):
trunkdict[item]=trunk.split('/')[-(len(directory_structure_template.split('/'))-i+1)]
```
%% Cell type:markdown id: tags:
Afterwards, we can associate all files in that directory with these attributes and the additional `time_range` and `path` using os:
%% Cell type:code id: tags:
``` python
import os
filelist=!ls {trunk}
catalog_dict=[]
for asset in filelist:
assetdict={}
assetdict["time_range"]=asset.split('.')[0].split('_')[-1]
assetdict["path"]=trunk+asset
assetdict.update(trunkdict)
catalog_dict.append(assetdict)
```
%% Cell type:markdown id: tags:
Then, we put that dict into the catalog:
%% Cell type:code id: tags:
``` python
intake_esm_catalog["catalog_dict"]=catalog_dict
```
%% Cell type:markdown id: tags:
### Option B: Catalog_file implementation
The `catalog_file` format needs to comply with the following rules:
- all file types that can be opened by pandas are allowed to be set as `catalog_file`
- the `.csv` file needs a header which includes all catalog attributes
An example would be:
```csv
filename,variable
/work/mh0287/m221078/prj/switch/icon-oes/experiments/khwX155/outdata/khwX155_atm_mon_18500101.nc,tas_gmean
```
```{note}
- Note that the *catalog_file* can also live in the cloud i.e. be an URL. You can host both the collection and catalog in the cloud as [DKRZ](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/ ) does.
```
%% Cell type:markdown id: tags:
```{admonition} Best practice
:class: tip
For keeping clear overview, you better use the same prefix name for both `catalog` and `catalog_file`.
```
%% Cell type:code id: tags:
``` python
import pandas as pd
catalog_dict_df=pd.DataFrame(catalog_dict)
```
%% Cell type:markdown id: tags:
### Saving a separate data base for [assets](#asset) or use a dictionary in the catalog?
```{tabbed} Advantages catalog_dict
- Only maintain one file which contains both catalog and collection
Suitable for smaller catalogs
```
```{tabbed} Disadvantages catalog_dict
- you cannot easily compress the catalog
- you can only use **one type of data access** for the catalog content. For CMIP6, we can provide access via `netcdf` or via `opendap`. We can create two collections for the same catalog file for covering both use cases.
Suitable for larger catalogs
```
%% Cell type:markdown id: tags:
### Use case: Updating the collection for a living project on file system
Solution: Write a **builder** script and run it as a cronjob (automatically and regularly):
A typical builder for a community project contains the following sequence:
1. Create one or more **lists of files** based on a `find` shell command on the data base directory. This type of job is also named *crawler* as it *crawls* through the file system.
1. Read the lists of files and create a `panda`s DataFrame for these files.
1. Parse the file names and file paths and fill column values. That can be easily done by deconstructing filepaths and filenames into their parts assuming you defined a mandatory
- Filenames that cannot be parsed should be sorted out
1. The data frame is saved as the final **catalog** as a `.csv` file. You can also compress it to `.csv.gz`.
At DKRZ, we run scripts for project data on disk repeatedly in cronjobs to keep the catalog updated.
%% Cell type:markdown id: tags:
#### Builder tool examples
- The NCAR [builder tool](https://github.com/NCAR/intake-esm-datastore/tree/e253f184ccc78906a08f1580282da070b898957a/builders) for community projects like CMIP6 and CMIP5.
- DKRZ builder notebooks (based on NCAR tools) like this [Era5 notebook](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/blob/master/builder/notebooks/dkrz_era5_disk_catalog.ipynb)
%% Cell type:markdown id: tags:
<a class="anchor" id="validate"></a>
## 3. Validate and save the catalog:
If we open the defined catalog with `open_esm_datastore()` and try `to_dataset_dict()`, we can check if our creation is successful. The resulting catalog should give us exactly 1 dataset from 18 assets as we aggregate over time.
%% Cell type:code id: tags:
``` python
import intake
validated_cat=intake.open_esm_datastore(catalog_dict_df, esmcol_data=intake_esm_catalog)
validated_cat=intake.open_esm_datastore(
obj=dict(
df=catalog_dict_df,
esmcat=intake_esm_catalog
)
)
validated_cat
```
%% Cell type:code id: tags:
``` python
validated_cat.to_dataset_dict()
```
%% Cell type:markdown id: tags:
Intake esm allows to write catalog file(s) with the `serialize()` function. The only argument is the **name** of the catalog which will be used as filename. It writes the two parts of the catalog either together in a `.json` file:
%% Cell type:code id: tags:
``` python
validated_cat.serialize("validated_cat")
```
%% Cell type:markdown id: tags:
Or in two seperated files if we provide `catalog_type=file` as a second argument. The `test.json` may be very large while we can save disk space if we svae the data base in a separate `.csv.gz` file:
%% Cell type:code id: tags:
``` python
validated_cat.serialize("validated_cat", catalog_type="file")
```
%% Cell type:markdown id: tags:
<a class="anchor" id="multi"></a>
## 4. Multivariable assets
If an [asset](#asset) contains more than one variable, `intake-esm` also features pre-selection of a variable before loading the data. [Here](https://intake-esm.readthedocs.io/en/latest/user-guide/multi-variable-assets.html) is a user guide on how to configure the collection for that.
1. the *variable_column* of the catalog must contain iterables (`list`, `tuple`, `set`) of values.
2. the user must specifiy a dictionary of functions for converting values in certain columns into iterables. This is done via the `csv_kwargs` argument such that the collection needs to be opened as follows:
```python
import ast
import intake
col = intake.open_esm_datastore(
"multi-variable-collection.json",
csv_kwargs={"converters": {"variable": ast.literal_eval}},
)
col
```
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Advanced *Summer Days* calculation with `xarray` using CMIP6 data
We will show here how to count the annual summer days for a particular geolocation of your choice using the results of a climate model, in particular, we can chose one of the historical or one of the shared socioeconomic pathway (ssp) experiments of the Coupled Model Intercomparison Project [CMIP6](https://pcmdi.llnl.gov/CMIP6/).
Thanks to the data and computer scientists Marco Kulüke, Fabian Wachsmann, Regina Kwee-Hinzmann, Caroline Arnold, Felix Stiehler, Maria Moreno, and Stephan Kindermann at DKRZ for their contribution to this notebook.
%% Cell type:markdown id: tags:
In this use case you will learn the following:
- How to access a dataset from the DKRZ CMIP6 model data archive
- How to count the annual number of summer days for a particular geolocation using this model dataset
- How to visualize the results
You will use:
- [Intake](https://github.com/intake/intake) for finding the data in the catalog of the DKRZ archive
- [Xarray](http://xarray.pydata.org/en/stable/) for loading and processing the data
- [hvPlot](https://hvplot.holoviz.org/index.html) for visualizing the data in the Jupyter notebook and save the plots in your local computer
%% Cell type:markdown id: tags:
## 0. Load Packages
%% Cell type:code id: tags:
``` python
import numpy as np # fundamental package for scientific computing
import pandas as pd # data analysis and manipulation tool
import xarray as xr # handling labelled multi-dimensional arrays
import intake # to find data in a catalog, this notebook explains how it works
from ipywidgets import widgets # to use widgets in the Jupyer Notebook
from geopy.geocoders import Nominatim # Python client for several popular geocoding web services
import folium # visualization tool for maps
import hvplot.pandas # visualization tool for interactive plots
```
%% Cell type:markdown id: tags:
## 1. Which dataset do we need? -> Choose Shared Socioeconomic Pathway, Place, and Year
<a id='selection'></a>
%% Cell type:code id: tags:
``` python
# Produce the widget where we can select what experiment we are interested on
#experiments = {'historical':range(1850, 2015), 'ssp126':range(2015, 2101),
# 'ssp245':range(2015, 2101), 'ssp370':range(2015, 2101), 'ssp585':range(2015, 2101)}
#experiment_box = widgets.Dropdown(options=experiments, description="Select experiment: ", disabled=False,)
#display(experiment_box)
```
%% Cell type:code id: tags:
``` python
# Produce the widget where we can select what geolocation and year are interested on
#print("Feel free to change the default values.")
#place_box = widgets.Text(description="Enter place:", value="Hamburg")
#display(place_box)
#x = experiment_box.value
#year_box = widgets.Dropdown(options=x, description="Select year: ", disabled=False,)
#display(year_box)
```
%% Cell type:code id: tags:
``` python
pb="Hamburg"
yb=2021
eb="ssp370"
%store -r
```
%% Cell type:markdown id: tags:
### 1.1 Find Coordinates of chosen Place
If ambiguous, the most likely coordinates will be chosen, e.g. "Hamburg" results in "Hamburg, 20095, Deutschland", (53.55 North, 10.00 East)
%% Cell type:code id: tags:
``` python
# We use the module Nominatim gives us the geographical coordinates of the place we selected above
geolocator = Nominatim(user_agent="any_agent")
location = geolocator.geocode(pb)
print(location.address)
print((location.latitude, location.longitude))
```
%% Cell type:markdown id: tags:
### 1.2 Show Place on a Map
%% Cell type:code id: tags:
``` python
# We use the folium package to plot our selected geolocation in a map
m = folium.Map(location=[location.latitude, location.longitude])
tooltip = location.latitude, location.longitude
folium.Marker([location.latitude, location.longitude], tooltip=tooltip).add_to(m)
display(m)
```
%% Cell type:markdown id: tags:
We have defined the place and time. Now, we can search for the climate model dataset.
%% Cell type:markdown id: tags:
## 2. Intake Catalog
### 2.1 Load the Intake Catalog
%% Cell type:code id: tags:
``` python
# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
parent_col=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
list(parent_col)
# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk"]
```
%% Cell type:markdown id: tags:
### 2.2 Browse the Intake Catalog
In this example we chose the Max-Planck Earth System Model in High Resolution Mode ("MPI-ESM1-2-HR") and the maximum temperature near surface ("tasmax") as variable. We also choose an experiment. CMIP6 comprises several kind of experiments. Each experiment has various simulation members. you can find more information in the [CMIP6 Model and Experiment Documentation](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html#5-model-and-experiment-documentation).
%% Cell type:code id: tags:
``` python
# Store the name of the model we chose in a variable named "climate_model"
climate_model = "MPI-ESM1-2-HR" # here we choose Max-Plack Institute's Earth Sytem Model in high resolution
# This is how we tell intake what data we want
query = dict(
source_id = climate_model, # the model
variable_id = "tasmax", # temperature at surface, maximum
table_id = "day", # daily maximum
experiment_id = eb, # what we selected in the drop down menu,e.g. SSP2.4-5 2015-2100
member_id = "r10i1p1f1", # "r" realization, "i" initialization, "p" physics, "f" forcing
)
# Intake looks for the query we just defined in the catalog of the CMIP6 data pool at DKRZ
col.df["uri"]=col.df["uri"].str.replace("lustre/","lustre02/")
cat = col.search(**query)
del col
# Show query results
cat.df
```
%% Cell type:markdown id: tags:
### 2.3 Create Dictionary and get Data into one Object
%% Cell type:code id: tags:
``` python
# cdf_kwargs are given to xarray.open_dataset
# cftime is like datetime but extends to all four digit years and many calendar types
dset_dict = cat.to_dataset_dict(cdf_kwargs={"chunks": {"time": -1}, "use_cftime": True})
```
%% Cell type:code id: tags:
``` python
# get all data into one object
for key, value in dset_dict.items():
model = key.split(".")[2] # extract model name from key
tasmax_xr = value["tasmax"].squeeze() # extract variable from dataset
```
%% Cell type:code id: tags:
``` python
tasmax_xr
```
%% Cell type:markdown id: tags:
## 3. Select Year and Look at (Meta) Data
%% Cell type:code id: tags:
``` python
tasmax_year_xr = tasmax_xr.sel(time=str(yb))
# Let's have a look at the xarray data array
tasmax_year_xr
```
%% Cell type:markdown id: tags:
We see not only the numbers, but also information about it, such as long name, units, and the data history. This information is called metadata.
%% Cell type:markdown id: tags:
## 4. Compare Model Grid Cell with chosen Location
%% Cell type:code id: tags:
``` python
# Find nearest model coordinate by finding the index of the nearest grid point
abslat = np.abs(tasmax_year_xr["lat"] - location.latitude)
abslon = np.abs(tasmax_year_xr["lon"] - location.longitude)
c = np.maximum(abslon, abslat)
([xloc], [yloc]) = np.where(c == np.min(c)) # xloc and yloc are the indices of the neares model grid point
```
%% Cell type:code id: tags:
``` python
# Draw map again
m = folium.Map(location=[location.latitude, location.longitude], zoom_start=8)
tooltip = location.latitude, location.longitude
folium.Marker(
[location.latitude, location.longitude],
tooltip=tooltip,
popup="Location selected by You",
).add_to(m)
#
tooltip = float(tasmax_year_xr["lat"][yloc].values), float(tasmax_year_xr["lon"][xloc].values)
folium.Marker(
[tasmax_year_xr["lat"][yloc], tasmax_year_xr["lon"][xloc]],
tooltip=tooltip,
popup="Model Grid Cell Center",
).add_to(m)
# Define coordinates of model grid cell (just for visualization)
rect_lat1_model = (tasmax_year_xr["lat"][yloc - 1] + tasmax_year_xr["lat"][yloc]) / 2
rect_lon1_model = (tasmax_year_xr["lon"][xloc - 1] + tasmax_year_xr["lon"][xloc]) / 2
rect_lat2_model = (tasmax_year_xr["lat"][yloc + 1] + tasmax_year_xr["lat"][yloc]) / 2
rect_lon2_model = (tasmax_year_xr["lon"][xloc + 1] + tasmax_year_xr["lon"][xloc]) / 2
# Draw model grid cell
folium.Rectangle(
bounds=[[rect_lat1_model, rect_lon1_model], [rect_lat2_model, rect_lon2_model]],
color="#ff7800",
fill=True,
fill_color="#ffff00",
fill_opacity=0.2,
).add_to(m)
m
```
%% Cell type:markdown id: tags:
Climate models have a finite resolution. Hence, models do not provide the data of a particular point, but the mean over a model grid cell. Take this in mind when comparing model data with observed data (e.g. weather stations).
Now, we will visualize the daily maximum temperature time series of the model grid cell.
%% Cell type:markdown id: tags:
## 5. Draw Temperature Time Series and Count Summer days
%% Cell type:markdown id: tags:
The definition of a summer day varies from region to region. According to the [German Weather Service](https://www.dwd.de/EN/ourservices/germanclimateatlas/explanations/elements/_functions/faqkarussel/sommertage.html), "a summer day is a day on which the maximum air temperature is at least 25.0 °C". Depending on the place you selected, you might want to apply a different threshold to calculate the summer days index.
%% Cell type:code id: tags:
``` python
tasmax_year_place_xr = tasmax_year_xr[:, yloc, xloc] - 273.15 # Convert Kelvin to °C
tasmax_year_place_df = pd.DataFrame(index = tasmax_year_place_xr['time'].values,
columns = ['Temperature', 'Summer Day Threshold']) # create the dataframe
tasmax_year_place_df.loc[:, 'Model Temperature'] = tasmax_year_place_xr.values # insert model data into the dataframe
tasmax_year_place_df.loc[:, 'Summer Day Threshold'] = 25 # insert the threshold into the dataframe
# Plot data and define title and legend
tasmax_year_place_df.hvplot.line(y=['Model Temperature', 'Summer Day Threshold'],
value_label='Temperature in °C', legend='bottom',
title='Daily maximum Temperature near Surface for '+pb,
height=500, width=620)
```
%% Cell type:markdown id: tags:
As we can see, the maximum daily temperature is highly variable over the year. As we are using the mean temperature in a model grid cell, the amount of summer days might we different that what you would expect at a single location.
%% Cell type:code id: tags:
``` python
# Summer days index calculation
no_summer_days_model = tasmax_year_place_xr[tasmax_year_place_xr > 25].size # count the number of summer days
no_summer_days_model = tasmax_year_place_xr[tasmax_year_place_xr.load() > 25].size # count the number of summer days
# Print results in a sentence
print("According to the German Weather Service definition, in the " +eb +" experiment the "
+climate_model +" model shows " +str(no_summer_days_model) +" summer days for " + pb
+ " in " + str(yb) +".")
```
%% Cell type:markdown id: tags:
[Try another location and year](#selection)
......
......@@ -18,7 +18,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This Jupyter notebook is meant to run in the Jupyterhub server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/) which is an [ESGF](https://esgf.llnl.gov/) repository that hosts 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the common geoscience packages. See more information on how to run Jupyter notebooks at DKRZ [here](https://www.dkrz.de/up/systems/mistral/programming/jupyter-notebook). Find there how to run this Jupyter notebook in the DKRZ server out of the Jupyterhub, which will entail that you create the environment accounting for the required package dependencies. Running this Jupyter notebook in your premise, which is also known as [client-side](https://en.wikipedia.org/wiki/Client-side) computing, will also require that you install the necessary packages on you own but it will anyway fail because you will not have direct access to the data pool. Direct access to the data pool is one of the main benefits of the [server-side](https://en.wikipedia.org/wiki/Server-side) data-near computing we demonstrate in this use case."
"This Jupyter notebook is meant to run in the Jupyterhub server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/) which is an [ESGF](https://esgf.llnl.gov/) repository that hosts 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the common geoscience packages. See more information on how to run Jupyter notebooks at DKRZ [here](https://www.dkrz.de/up/systems/mistral/programming/jupyter-notebook). Find there how to run this Jupyter notebook in the DKRZ server out of the Jupyterhub, which will entail that you create the environment accounting for the required package dependencies. Running this Jupyter notebook in your premise, which is also known as [client-side](https://en.wikipedia.org/wiki/Client-side) computing, will also require that you install the necessary packages on you own but it will anyway fail because you will not have direct access to the data pool. Direct access to the data pool is one of the main benefits of the [server-side](https://en.wikipedia.org/wiki/Server-side) data-near computing we demonstrate in this use case. "
]
},
{
......
%% Cell type:markdown id: tags:
# Climate Extremes Indices with CDOs according to the ETCCDI standard
%% Cell type:markdown id: tags:
A **climate index** is a calculated measure for the state and/or variations of the climate system. In the field of meteorology, many definitions for different types of climate indices exist. For example, the German Weather Service defines a **Klimakenntag**: If a climatological parameter exceeds a specific threshold at one day, the day is considered as a specific klimakenntag.
The expert team ETCCDI has defined a core set of descriptive indices of extremes (Climate Extremes Indices, CEI) in order to
> "gain a **uniform perspective on observed changes in weather and climate extremes**.
These indices have become a standard in the climate science community. They describe particular characteristics of extremes including *frequency, amplitude and persistence*.
%% Cell type:markdown id: tags:
### Learning Objectives
In this notebook, you will learn to
- calculate 4 kinds of CEIs, **absolute**, **threshold**, **duration** and **percentile**-based indices according to the ETCCDI standard with CDOs
- calculate running window percentiles according to the ETCCDI standard with CDOs
%% Cell type:markdown id: tags:
### Requirements
- intake
- cdo
%% Cell type:code id: tags:
``` python
from IPython.display import HTML
HTML('<iframe src="https://slides.com/wachsylon/cdoetccdi/embed" width="576" height="420" scrolling="no" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen></iframe>')
```
%% Cell type:markdown id: tags:
## 0. Preparation
In the following,
- you will `import` packages required to run `cdo` in python. We use the `cdo` binary from the environment which was used to create the kernel. You might want to change that.
- you will get a time series of the variables `tasmin` and `pr` which can be used to calculate all types of indices. We choose a subset from the `historical` experiment form the recent model intercomparison project *CMIP6* for the time interval 1970-1989.
%% Cell type:code id: tags:
``` python
#set cdo binary to the one installed in the environment of the kernel
import sys
import os
cdobin=os.path.sep.join(sys.executable.split(os.path.sep)[:-1]+["cdo"])
#
#import python cdo
from cdo import *
cdo = Cdo(cdobin)
cdo.debug=True
#This prohibits that existing files are created a second time
cdo.forceOutput = False
```
%% Cell type:code id: tags:
``` python
import intake
# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
list(dkrz_catalog)
# Open the catalog with the intake package and name it "col" as short for "collection"
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]+["opendap_url"]
col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["read_csv_kwargs"]["usecols"]+["opendap_url"]
col=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cols))
```
%% Cell type:code id: tags:
``` python
# Store the name of the model we chose in a variable named "climate_model"
source_id = "MPI-ESM1-2-LR" # here we choose Max-Plack Institute's Earth Sytem Model in high resolution
query = dict(
source_id = source_id, # the model
variable_id = ["tasmin","pr"],
table_id = "day", # daily
experiment_id = "historical", #
member_id = "r10i1p1f1", # "r" realization, "i" initialization, "p" physics, "f" forcing
time_range = "19700101-19891231"
)
# Intake looks for the query we just defined in the catalog of the CMIP6 data pool at DKRZ
col.df["uri"]=col.df["uri"].str.replace("lustre/","lustre02/")
cat = col.search(**query)
# Show query results
cat.df
```
%% Cell type:code id: tags:
``` python
#download files
urlpr=cat.df[cat.df["variable_id"] == "pr"]["opendap_url"].values[0].replace("dodsC","fileServer")
urltas=cat.df[cat.df["variable_id"] == "tasmin"]["opendap_url"].values[0].replace("dodsC","fileServer")
!wget {urlpr}
!wget {urltas}
```
%% Cell type:code id: tags:
``` python
#define temporary output files
prorig = cat.df[cat.df["variable_id"] == "pr"]["opendap_url"].values[0].split('/')[-1]
tasorig = cat.df[cat.df["variable_id"] == "tasmin"]["opendap_url"].values[0].split('/')[-1]
prHamburg ="pr_hamburg.nc"
tasminHamburg="tasmin_hamburg.nc"
```
%% Cell type:code id: tags:
``` python
#Select a subarea because memory might not be large enough
#and change units
prep_ts="-sellonlatbox,9,10,53,54 "
prep_pr="-mulc,86400 "+prep_ts
pr=cdo.mulc("1",
input=prep_pr+prorig,
output=prHamburg)
tasmin=cdo.mulc("1",
input=prep_ts+tasorig,
output=tasminHamburg)
```
%% Cell type:markdown id: tags:
## 1. Absolute indices
***
- `txx`, `txn`, `tnx`, `tnn`
- daily temperature range `dtr`, intensity intdex `sdii`
- `rx1day`, `rx5day`, `prcptot`
***
%% Cell type:markdown id: tags:
While the other 3 categories of indices are defined in temporal *units*, the **absolute** indices have the same *units* as the input variable. The **absolute indices** are useful because its values
> can often be related to extreme events that affect human society and the natural environment
Absolute indices are easy to compute with basic CDOs. E.g., the minimum of daily minimum temperature can be calculated with
```bash
cdo yearmin tasmin tnn.nc
```
However, the `etccdi_` indices produces variables named according to the ETCCDI standard. Also, you can specify an output frequency. When it comes to the *precipiation related* absolute indices, you cannot work with basic cdos anymore. In the following, we show you an example for the *highest five day precipitation sum*:
%% Cell type:code id: tags:
``` python
#Highest 5day percipitation sum
!export CDO_TIMESTAT_DATE="last"
#$cdo eca_rx5day,50,freq=year -runsum,5
rx5day="rx5day_hamburg.nc"
rx5day_values = cdo.etccdi_rx5day(input="-runsum,5 "+prHamburg,
output="rx5day_hamburg.nc",
returnCdf=True).variables[
"rx5dayETCCDI"][:]
rx5day_values = rx5day_values.flatten()
```
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
plt.plot(rx5day_values)
plt.grid()
plt.xlabel("Year")
plt.ylabel("Precipitation sum over 5 days [mm]")
plt.show()
```
%% Cell type:markdown id: tags:
## 2. Threshold exceedances
***
- frost days `fd`, ice days `id`, summer days `su`, tropical nights `tr`
- `r1mm`, `r10mm`, `r20mm`
***
%% Cell type:markdown id: tags:
Threshold based CEIs are
> Indices based on the count of days crossing certain **fixed thresholds** (for example, the 0°C threshold as used in the frost days index FD) can also be related to observed impacts, in particular if the thresholds refer to values of physical, hydrological or biological significance.
We can calculate `frost days` with yearly frequency as follows:
%% Cell type:code id: tags:
``` python
#frost days
# $cdo eca_fd,freq=year
fd_values = cdo.etccdi_fd(input=tasminHamburg,
output="fd_hamburg.nc",
returnCdf=True).variables[
"fdETCCDI"][:]
fd_values = fd_values.flatten()
```
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
plt.plot(fd_values)
plt.grid()
plt.xlabel("Year")
plt.ylabel("Number of Frost days per year")
plt.show()
```
%% Cell type:markdown id: tags:
## 3. Create Percentiles
%% Cell type:markdown id: tags:
In contrast to fixed thresholds based indices, for some indices individual grid cell percentiles for each day are used as thresholds. Using only one day per day results in a rather small sample size which is equal to the lenght of the base period. In climate science, this base period is usually 20 or 30 years long. In order to construct percentiles that are more *meaningful*, we use a running window around the target day to enlarge the sample size. We will run `ydrunpctl` which also requires `ydrunmin` and `ydrunmax`.
%% Cell type:markdown id: tags:
The running minimum and running maximum are needed as input files for *ydrunpctl* so that the operator can create **bins** for a histogram. However, CEIs are based not on **bin-sorted** histograms but use a **value-sorted** histogram. Such a histogram can only be created if there is enough memory to save all values. This is only the case, if the environment parameter *CDO_PCTL_NBINS* can be set to
```bash
nbins="$((windowsize*(end_year-start_year+1)*2+2))"
```
which is in our case
```bash
nbins="$((5*(1989-1970+1)*2+2))"
nbins=202
#and
export CDO_PCTL_NBINS=202
```
Whether this is possible depends on the system.
%% Cell type:markdown id: tags:
Two new arguments have been introduced for the `ydrun`* operators in comparison to older cdo versions for matching the requirements defined by the ETCCDI standard calculation methods:
- **rm** : The **read_method** can be set to `c` for "circular" which takes into account the last time steps at the begin of the time period and vise versa. Otherwise, the first and last time steps are not used as often as the other time steps in the calculations.
- **pm** : Since a lot of methods exist to calculate a percentile, CDO will allow to set ***percentileMethod*** in the operator call. ETCCDI recommends a method implemented in the software language R as type8. This is right now the only option for the argument.
%% Cell type:markdown id: tags:
The resulting command line calls of these operators required for CEIs look like:
```bash
cdo ydrunmin,5,rm=c tasmin_hamburg.nc tasmin_runmin.nc
cdo ydrunmax,5,rm=c tasmin_hamburg.nc tasmin_runmax.nc
cdo ydrunpctl,5,rm=c,pm=r8 tasmin_runmin.nc tasmin_runmax.nc tn10thresh.nc
```
%% Cell type:code id: tags:
``` python
windowsize=5
readMethod="circular"
percentileMethod="rtype8"
```
%% Cell type:code id: tags:
``` python
# $cdo ydrunmin,5,rm=c $tasminMerged $tasminrunmin
cdo.ydrunmin(windowsize,"rm="+readMethod,
input=tasminHamburg,
output="tasmin_runmin.nc")
```
%% Cell type:code id: tags:
``` python
# $cdo ydrunmax,5,rm=c $tasminMerged $tasminrunmin
cdo.ydrunmax(windowsize,"rm="+readMethod,
input=tasminHamburg,
output="tasmin_runmax.nc")
```
%% Cell type:code id: tags:
``` python
#If you set this environment parameter,
#histograms are ordered values instead of bins
!export CDO_PCTL_NBINS=202
# $cdo subc,273.15 -ydrunpctl,10,5,pm=r8,rm=c $tasminMerged ${tasminrunmin} ${tasminrunmax} ${tn10thresh}
cdo.ydrunpctl(10,windowsize,"rm="+readMethod,"pm="+percentileMethod,
input=tasminHamburg+" tasmin_runmin.nc tasmin_runmax.nc",
output="tn10thresh.nc")
# $cdo subc,273.15 -ydrunpctl,90,5,pm=r8,rm=c $tasminMerged ${tasminrunmin} ${tasminrunmax} ${tn10thresh}
cdo.ydrunpctl(90,windowsize,"rm="+readMethod,"pm="+percentileMethod,
input=tasminHamburg+" tasmin_runmin.nc tasmin_runmax.nc",
output="tn90thresh.nc")
```
%% Cell type:markdown id: tags:
## 4. Duration indices
***
- cold spell duration index `csdi`, warm spell duration index `wsdi`
- consecutive dry days `cdd`, consecutive wet days `cwd`
- growing season lengths `gsl`
***
%% Cell type:markdown id: tags:
Duration indices are self-explanatory. They allow to characterize periods of extremes. In comparison to older `eca_` indices, the `etccdi_` indices implemented in `cdo` have two major changes. They
1. count periods over overlapping years (or month, depending on the output frequency) and the final period will get the time stamp of the last contributing day.
2. allow to have less time steps in the percentile threshold file than in the original time series file.
The corresponding command line call looks like
```bash
cdo eca_cwfi,6,freq=year tasminHamburg.nc tn10thresh.nc csdi_hamburg.nc
```
%% Cell type:code id: tags:
``` python
#Cold spell duration index (cold wave index)
csdi="csdi_hamburg.nc"
csdiValues = cdo.etccdi_csdi(6,"freq=year",
input=tasminHamburg+" tn10thresh.nc",
output=csdi,
returnCdf=True).variables['csdiETCCDI'][:]
csdiValues = csdiValues.flatten()
```
%% Cell type:code id: tags:
``` python
plt.plot(csdiValues)
plt.show()
```
%% Cell type:code id: tags:
``` python
#Consecutive Wet Days
#Precipitation threshold [mm]:
pt=1
#Minimum number of days exceeded for a second variable:
md=5
#!cdo eca_cwd,1,5,freq=year prHamburg.nc cwd_hamburg.nc
```
%% Cell type:code id: tags:
``` python
cwd_values = cdo.etccdi_cwd(input=prHamburg,
output="cwdHamburg.nc",
returnCdf=True).variables[
"cwdETCCDI"][:]
cwd_values = cwd_values.flatten()
```
%% Cell type:code id: tags:
``` python
plt.hist(cwd_values,bins= [5.5,6.5,7.5,8.5,
9.5,10.5,11.5,12.5,13.5])
plt.grid()
plt.xlabel("Largest number of consecutive"
" wet days per year")
plt.ylabel("Frequency")
plt.show()
```
%% Cell type:markdown id: tags:
## 5. Percentile based indices
***
tx10p tx90p tn10p tn90 r95p r99p
***
%% Cell type:markdown id: tags:
The reason for choosing mostly percentile thresholds rather than fixed thresholds is
> the number of days exceeding percentile thresholds is more evenly distributed in space and is meaningful in every region”
The temperature related percentile based indices (*tx90p, tx10p, tn90p, tn10p*) require a special percentile calculation method for years that lie *inside the base period*. For that years, **bootstrapping** must be applied where the base period is modified: the index year is taken from the base period and is replaced by another year. Then, the percentile as well as the index is calculated for the new 30 year base period. This is done 29 times so that each year from the base period will be accounted twice. In the end, the mean of 29 indices is taken.<br>
Therefore, the operators need input arguments:
1. The window size (5)
2. The start year of the bootstrapping interval (1970)
3. The end year of the bootstrapping interval (1989)
4. The output frequency
The corresponding command line call looks like:
```bash
cdo etccdi_tn10p,5,1970,1989,freq=year tasmin_hamburg.nc tasmin_runmin.nc tasmin_runmax.nc tn10p_hamburg.nc
```
%% Cell type:code id: tags:
``` python
!rm tn10p_hamburg.nc
!export CDO_PCTL_NBINS=302
tn10p_values = cdo.etccdi_tn10p(5,1970,1989,"freq=year",
input=tasminHamburg+" tasmin_runmin.nc tasmin_runmax.nc",
output="tn10p_hamburg.nc",
returnCdf=True).variables["tn10pETCCDI"][:]
tn10p_values = tn10p_values.flatten()
```
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
plt.plot(tn10p_values)
plt.grid()
plt.xlabel("Year")
plt.ylabel("Number of days with tmin < tmin90")
plt.show()
```
%% Cell type:markdown id: tags:
### Used data
- https://doi.org/10.22033/ESGF/CMIP6.6595
%% Cell type:markdown id: tags:
We acknowledge the CMIP community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP data of this study were replicated and made available for this study by the DKRZ.”
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Prepare modeling data for GIS analyses
The most common way to work with climate modeling data sets is storing them in NetCDF or GRIB format and processing them using Xarray. Raster data sets used in GIS applications are usually stored in raster formats such as GeoTIFF, BIL, or IMG and being processed mainly with the help of GDAL-based python libraries. While modeling data sets are normally georeferenced using grids explicitly described by coordinates and conforming to e.g., CF sandard, geographic reference of GIS data sets is described by a coordinate reference system (CRS). This diferences imply some steps that need to be undertaken in order to convert data sets from one format to another.
In this notebook we create an aggregated map of pre-industrial Vertically Averaged Sea Water Potential Temperature (thetaot) from CMIP6 and save it as GeoTIFF.
First, we import all necessary libraries.
%% Cell type:code id: tags:
``` python
%matplotlib inline
import intake, requests, aiohttp
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import numpy as np
import xarray as xr
import cf_xarray as cfxr
import xesmf as xe
import scipy.sparse as sps
# Rioxarray is an experimental interface between Xarray and Rasterio libraries
# made for loading GDAL-readable rasters as xarray datasets and saving xarrays as raster files.
import rioxarray
#print("Using roocs/clisops in version %s" % cl.__version__)
print("Using xESMF in version %s" % xe.__version__)
xr.set_options(display_style = "html");
import warnings
warnings.simplefilter("ignore")
```
%% Cell type:markdown id: tags:
## 1. Find and load the data set
Then we find the data set we need in the DKRZ open catalogue and print its metadata.
%% Cell type:code id: tags:
``` python
# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
# Print DKRZ open catalogues
list(dkrz_catalog)
```
%% Cell type:code id: tags:
``` python
# Open the catalog with the intake package and name it "col" as short for "collection"
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]+["opendap_url"]
col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["read_csv_kwargs"]["usecols"]+["opendap_url"]
col=dkrz_catalog.dkrz_cmip6_disk(read_csv_kwargs=dict(usecols=cols))
cat = col.search(variable_id = "thetaot",
experiment_id = "historical",
time_range = "185001-194912",
member_id = "r1i1p1f3"
)
cat.df
```
%% Cell type:code id: tags:
``` python
dset = xr.open_dataset(cat.df.uri.values[0], chunks = {"time": 6})
dset["thetaot"]
```
%% Cell type:markdown id: tags:
As we can see, coordinates are stored as arrays of grid cell indices, which is not a common format in the GIS community.
## 2. Calculate pre-industrail mean
We make a subset limiting the data set to pre-industrial era, calculate the mean values per grid cell, and plot the result.
%% Cell type:code id: tags:
``` python
dset["time"] = dset["time"].dt.strftime("%Y%m%d")
pre_industrial = dset["thetaot"].sel(time = slice("1850", "1880"))
pre_industrial_mean = pre_industrial.mean(dim = "time", keep_attrs = True)
pre_industrial_mean.plot()
```
%% Cell type:markdown id: tags:
If we plot the coastlines from cartopy over the dataset, we will see that the coordinate systems of these datasets are... a bit different.
%% Cell type:code id: tags:
``` python
fig = plt.figure(1, figsize = [30, 13])
# Set the projection to use for plotting
ax = plt.subplot(1, 1, 1, projection=ccrs.PlateCarree())
ax.coastlines()
# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting
# isel allows to select by indices instead of the time values
pre_industrial_mean.plot.pcolormesh(ax = ax, cmap = "coolwarm")
```
%% Cell type:markdown id: tags:
If we visualize our data set's grid, it becomes obvious that it is far from a standard grid.
%% Cell type:code id: tags:
``` python
import textwrap
def plot_curv_grid(ds, var = "tos"):
lat = cfxr.accessor._get_with_standard_name(ds, "latitude")[0]
lon = cfxr.accessor._get_with_standard_name(ds, "longitude")[0]
if any([i == None for i in [lat, lon]]):
print(ds.attrs["source_id"], ": Cannot identify latitude/longitude.")
return
plt.figure(figsize = (16, 9), dpi=120)
plt.scatter(x = ds[lon], y = ds[lat], s = 0.01)
# x, y = np.meshgrid(ds[lon], ds[lat])
# plt.scatter(x = x, y = y, s = 0.01)
try:
plt.title("\n".join(textwrap.wrap(ds.attrs["source_id"] + "(" + ds.attrs["source"].split("ocean:")[-1].split("\n")[0] + ")", 120)))
except (KeyError, IndexError):
plt.title(ds.attrs["source_id"])
plot_curv_grid(dset)
```
%% Cell type:markdown id: tags:
## 3. Reformat the grid and regridd the data set.
As we mentioned before, the coordinates need to be re-defined.
%% Cell type:code id: tags:
``` python
# Later one can just install and use clisops, once functions like this are merged into the master branch
def grid_reformat(ds):
# Define lat, lon, lat_bnds, lon_bnds
lat = ds.lat[:, 0]
lon = ds.lon[0, :]
lat_bnds = np.zeros((lat.shape[0], 2), dtype = "double")
lon_bnds = np.zeros((lon.shape[0], 2), dtype = "double")
# From (N + 1, M + 1) shaped bounds to (N, M, 4) shaped vertices
lat_vertices = cfxr.vertices_to_bounds(
ds.lat_b, ("bnds", "lat", "lon")
).values
lon_vertices = cfxr.vertices_to_bounds(
ds.lon_b, ("bnds", "lat", "lon")
).values
lat_vertices = np.moveaxis(lat_vertices, 0, -1)
lon_vertices = np.moveaxis(lon_vertices, 0, -1)
# From (N, M, 4) shaped vertices to (N, 2) and (M, 2) shaped bounds
lat_bnds[:, 0] = np.min(lat_vertices[:, 0, :], axis = 1)
lat_bnds[:, 1] = np.max(lat_vertices[:, 0, :], axis = 1)
lon_bnds[:, 0] = np.min(lon_vertices[0, :, :], axis = 1)
lon_bnds[:, 1] = np.max(lon_vertices[0, :, :], axis = 1)
# Create dataset
ds_ref = xr.Dataset(
coords = {
"lat": (["lat"], lat.data),
"lon": (["lon"], lon.data),
"lat_bnds": (["lat", "bnds"], lat_bnds.data),
"lon_bnds": (["lon", "bnds"], lon_bnds.data),
},
)
# Add variable attributes to the coordinate variables
ds_ref["lat"].attrs = {
"bounds": "lat_bnds",
"units": "degrees_north",
"long_name": "latitude",
"standard_name": "latitude",
"axis": "Y",
}
ds_ref["lon"].attrs = {
"bounds": "lon_bnds",
"units": "degrees_east",
"long_name": "longitude",
"standard_name": "longitude",
"axis": "X",
}
ds_ref["lat_bnds"].attrs = {
"long_name": "latitude_bounds",
"units": "degrees_north",
}
ds_ref["lon_bnds"].attrs = {
"long_name": "longitude_bounds",
"units": "degrees_east",
}
return ds_ref
```
%% Cell type:code id: tags:
``` python
# Specify a global 1deg target grid
# With clisops
#ds_out = clore.Grid(grid_instructor=(1., )).ds
# Without clisops
ds_out = grid_reformat(xe.util.grid_global(1., 1.))
ds_out
```
%% Cell type:markdown id: tags:
And now we regrid the data set to a normal 1-degree global grid.
%% Cell type:code id: tags:
``` python
# In case of problems, activate ESMF verbose mode
import ESMF
ESMF.Manager(debug = True)
# Regridding methods
#method_list = ["bilinear", "nearest_s2d", "patch", "conservative", "conservative_normed"]
method_list = ["bilinear", "nearest_s2d", "patch"]
# Function to generate the weights
# If grids have problems of degenerated cells near the poles there is the ignore_degenerate option
def regrid(ds_in, ds_out, method, periodic, unmapped_to_nan = True, ignore_degenerate = None):
"""Convenience function for calculating regridding weights"""
ds_in["lat"] = ds_in["latitude"]
ds_in["lon"] = ds_in["longitude"]
return xe.Regridder(ds_in, ds_out, method, periodic = periodic,
ignore_degenerate = ignore_degenerate)
```
%% Cell type:code id: tags:
``` python
%time regridder = regrid(pre_industrial_mean, ds_out, "bilinear", periodic = True, unmapped_to_nan = True, ignore_degenerate = None)
```
%% Cell type:code id: tags:
``` python
regridder
```
%% Cell type:code id: tags:
``` python
regridded_ds = regridder(pre_industrial_mean, keep_attrs = True)
regridded_ds
```
%% Cell type:code id: tags:
``` python
fig = plt.figure(1, figsize = [30, 13])
# Set the projection to use for plotting
ax = plt.subplot(1, 1, 1, projection = ccrs.PlateCarree())
ax.coastlines()
# Pass ax as an argument when plotting. Here we assume data is in the same coordinate reference system than the projection chosen for plotting
# isel allows to select by indices instead of the time values
regridded_ds.plot.pcolormesh(ax = ax, cmap = "coolwarm")
```
%% Cell type:markdown id: tags:
As we can see, the data has a correct CRS now.
## 4. Write a CRS and save the data set as a GeoTIFF
It is pretty straitforward to write a CRS to an xarray with rioxarray. Here we used the World Geodetic System 1984 CRS which is quite common and used by default, for example in QGIS.
%% Cell type:code id: tags:
``` python
#You might need to rename the axes before
transformed = regridded_ds.rename({"lon":"x", "lat":"y"})
transformed = transformed.rio.write_crs("EPSG:4326")
```
%% Cell type:markdown id: tags:
If we want to re-project the data set to e.g., Web Mercator, wich is used by default in Google Maps and ArcGIS, we do it as follows:
%% Cell type:code id: tags:
``` python
transformed = transformed.rio.reproject("EPSG:3857")
```
%% Cell type:markdown id: tags:
And finally, saving our dataset as a GeoTIFF is as easy as this:
%% Cell type:code id: tags:
``` python
transformed.rio.to_raster("regridded_3857.tif")
```
......
%% Cell type:markdown id: tags:
# Advanced quantile calculation for the MPI-ESM1-2-LR CMIP6 ensemble with **xarray**, **pandas** and **hvplot**
We will show how to calculate statistics of the MPI-ESM1-2-LR ensemble published within the Coupled Model Intercomparison Project [CMIP6](https://pcmdi.llnl.gov/CMIP6/). Quantiles of a single-model ensemble allow to estimate the simulated climate variability. This notebook will calculate 5 quantiles for a specified region on the globe and compares them with the global quantiles. Interactive plots will be used for data visualization for easy interpretation.
We will choose one variable of multiple experiments and compare the results of different models. In particular, we analyse the historical experiment in combination with one of the shared socioeconomic pathway (ssp) experiments.
This Jupyter notebook is meant to run in the [Jupyterhub](https://jupyterhub.dkrz.de/hub/login?next=%2Fhub%2Fhome) server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/). The DKRZ hosts the CMIP data pool including 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the common geoscience packages. See more information on how to run Jupyter notebooks at DKRZ [here](https://www.dkrz.de/up/systems/mistral/programming/jupyter-notebook).
Running this Jupyter notebook in your premise, which is also known as [client-side](https://en.wikipedia.org/wiki/Client-side) computing, will require that you install the necessary packages and download data.
%% Cell type:markdown id: tags:
### Learning Objectives
- How to access a dataset from the DKRZ CMIP data pool with `intake-esm`
- How to use `xarray` and `pandas` for data analysis
- How to visualize the results with `hvplot`
%% Cell type:markdown id: tags:
### Requirements
- About 10GB memory
%% Cell type:code id: tags:
``` python
import intake
import pandas as pd
import hvplot.pandas
import hvplot.xarray
import numpy as np
import cartopy.crs as ccrs
```
%% Cell type:markdown id: tags:
### Parameters
Along with the *variable_id*, we can specify
- the number of ensemble members `ens_size` that we want to take into account of our analysis
- a latitude and longitude combination for our region for comparing with global climate variability
%% Cell type:code id: tags:
``` python
# Choose one of
# pr, psl, tas, tasmax, tasmin, clt
variable_id = "tas"
# 1-30
ens_size=29
# lat and lon in deg
regions=dict(
user_region=dict(
latitudes=[45,55],
longitudes=[0,10]
)
)
%store -r
```
%% Cell type:code id: tags:
``` python
regions["global"]=dict(
latitudes=[-90,90],
longitudes=[0,360]
)
```
%% Cell type:code id: tags:
``` python
# get formating done automatically according to style `black`
#%load_ext lab_black
```
%% Cell type:markdown id: tags:
### Initialization
The `intake-esm` software reads *Catalogs* which we use to **find, access and load** the data we are interested in. Daily updated CMIP6 catalogs are provided in DKRZ's cloud [swift](https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/).
%% Cell type:code id: tags:
``` python
# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
parent_col=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
list(parent_col)
```
%% Cell type:code id: tags:
``` python
col=parent_col["dkrz_cmip6_disk"]
```
%% Cell type:markdown id: tags:
We are searching for the ensemble of the ESM *MPI-ESM1-2-LR* for the time span 1850 to 2100. We use the *historical* and *ssp585* experiments for the analysis. For now, we are averaging variables from the table Amon which contains atmospheric (**A**mon) submodel output in monthly (A**mon**) frequency.
%% Cell type:code id: tags:
``` python
query = dict(
variable_id=variable_id,
table_id="Amon",
experiment_id=["historical", "ssp585"], # we have excluded "ssp245" from the list because it would take 15min to finish the nb
source_id=["MPI-ESM1-2-LR"]#, "AWI-CM-1-1-MR"],
)
col.df["uri"]=col.df["uri"].str.replace("lustre/","lustre02/")
cat = col.search(**query)
```
%% Cell type:markdown id: tags:
We open our selection and hold them as `xarray` datasets:
%% Cell type:code id: tags:
``` python
print("Opening the results as xarray dsets...")
xr_dsets=cat.to_dataset_dict(progressbar=True)
```
%% Cell type:markdown id: tags:
### Post-processing
The main post-processing step will be done in the `spatial_weighted_yearly_mean` function. It contains:
1. Select a region from the globally model data
1. Calculate a weighted spatial mean. The weights are only valid for a regular shaped lon-lat grid.
1. Average over years
This will be done for the provided *data set, variable_id, lats and lons*
%% Cell type:code id: tags:
``` python
def spatial_weighted_yearly_mean(dset, variable_id, lats, lons):
var = dset.get(variable_id)
# Var zoomed:
var = var.sel(lat=slice(lats[0],lats[1]),
lon=slice(lons[0],lons[1]))
# Get weights
weights = np.cos(np.deg2rad(dset.lat))
# Var weighted
varwg = var.weighted(weights)
# Var global mean:
vargm = varwg.mean(("lon", "lat"))
# Var yearly mean:
vargmym = vargm.groupby("time.year").mean("time")
return vargmym.values
```
%% Cell type:markdown id: tags:
#### Calculate reference climate for anomalies
As a refrence period, we define the 1971 to 2000 time interval. This is only available in the *historical* experiment.
We calculate a 30-yr mean of *spatial weighted yearly mean* arrays of the target variable.
%% Cell type:code id: tags:
``` python
def calculate_historical_gmym(xr_dsets, variable_id, lats, lons):
historical = [key for key in xr_dsets.keys() if "historical" in key][0]
dset_hist = xr_dsets[historical]
dset_hist_thirty = dset_hist.sel(time=dset_hist.time.dt.year.isin(range(1971, 2000)))
# 10member
hist_gmym = spatial_weighted_yearly_mean(dset_hist_thirty, variable_id, lats, lons)
return hist_gmym.mean()
```
%% Cell type:markdown id: tags:
### Arrange anomalies results in a DataFrame
For an area plot, `hvplot` works best on `DataFrame`s - the `xarray`-extension of `hvplot` converts the data array into a data frame for an area plot anyway but takes way longer. So it is decided to create a table with
- *years* from 1850 to 2100 as index
- *ensembles* as columns
This table is filled with the anomaly of the *spatial weighted yearly means* of our target variable in comparison with the 30yr historical reference period 1971-2000.
%% Cell type:code id: tags:
``` python
def calculate_vargmym(xr_dsets, hist_gmymref, variable_id, lats, lons):
years = [i for i in range(1850,2101)]
vargmympd = pd.DataFrame(index=years, columns=["r"+str(i) for i in range(ens_size)], dtype=float)
vargmympd.index.name = "Year"
for key in xr_dsets:
datatoappend = spatial_weighted_yearly_mean(xr_dsets[key], variable_id, lats,lons) - hist_gmymref
insert_years = list(xr_dsets[key].get(variable_id).groupby("time.year").groups.keys())
for i in range(ens_size):
vargmympd.loc[insert_years, "r"+str(i)] = datatoappend[i]
return vargmympd
```
%% Cell type:markdown id: tags:
### Plotting
#### Areas between quantiles and line for median
The final plot should contain **two areas** and **one line**.
- One area spans the interval between the first and 99th quantile i.e. it contains 98% of all values of the ensemble
- One area spans the interval between the 25th and 75th quantile i.e. it contains 50% of all values of the ensemble
- The line corresponds to the median i.e. it shows the 50th quantile
With `hvplot`, we can generate three individual plots and combine them with `*`:
%% Cell type:code id: tags:
``` python
def plot_quantiles_area(quantiles,ylabel):
ninetyeight=quantiles.hvplot.area(grid=True,
y="0.01",
y2="0.99",
width=820,
color="aliceblue",
label="98% of all values")
fifty=quantiles.hvplot.area(grid=True,
y="0.25",
y2="0.75",
width=820,
color="aqua",
label="50% of all values")
median=quantiles.hvplot.line(grid=True,
y="0.5",
color="cornflowerblue",
label="median")
comb=(ninetyeight*fifty*median).opts(ylabel=ylabel, legend_position='bottom_right')
return comb
```
%% Cell type:markdown id: tags:
#### Axis label for plotting
We can get metadata for the chosen variable from the files so that the script can generate an axis label automatically. The metadata is stored in a `dict`-like `.attrs` variable of the dataset. The function `get_label` returns the fitting label:
%% Cell type:code id: tags:
``` python
def get_label(xr_dset, variable_id):
lname = xr_dset.get(variable_id).attrs["long_name"]
units = xr_dset.get(variable_id).attrs["units"]
return "Delta " + lname + "[" + units + "]"
```
%% Cell type:markdown id: tags:
### Running the workflow
For both regions, the user specified one and the global comparison, we run
1. Calculation of a 30-yr mean historical reference for anomaly calculation
1. Calculation of the anomalies for each year and realization spatially weighted and averaged over the region
1. Calculation of the quantiles
1. Creation of the plots
%% Cell type:code id: tags:
``` python
plots=[]
quantiles=[]
for k,region in regions.items():
lats=region["latitudes"]
lons=region["longitudes"]
ylabel=get_label(xr_dsets[list(xr_dsets.keys())[0]], variable_id)
label="lats: "+str(lats[0])+"-"+str(lats[1])
print("1. Calculate historical reference...")
hist_gmymref = calculate_historical_gmym(xr_dsets, variable_id, lats, lons)
print("2. Calculate spatial and yearly mean of variable "+variable_id+" ...")
var_gmym= calculate_vargmym(xr_dsets, hist_gmymref, variable_id, lats, lons)
print("3. Calculate quantiles ...")
quantile=var_gmym.transpose().quantile([0.01, 0.25, 0.5, 0.75, 0.99]).transpose()
quantiles.append(quantile)
print("4. Creating plots and diagnostic ...")
plots.append(plot_quantiles_area(quantile, ylabel))
plots.append((quantile[0.99]-quantile[0.01]).hvplot.line(
color="red",
grid=True,
ylabel=ylabel,
label="0.99-0.01 Quantile, "+label
))
```
%% Cell type:markdown id: tags:
### Displaying the result
We combine all plots into one with one column. The first two plots show the results for the region specifed by the user, the second two for the global mean.
%% Cell type:code id: tags:
``` python
(plots[0]+plots[1]+plots[2]+plots[3]).cols(1)
```
%% Cell type:markdown id: tags:
The variability of the *regional* average of the yearly mean temperature is about 2-3 °C for mid-latitudes close to Europe while there is less variability in the global mean. 28 out of 30 members of the MPI-ESM1-2-LR indicate that, in a *ssp585 world*,
- Europeans will not experience another colder-than-usual year with yearly mean temperature smaller than the historical mean **from about 2030 onwards**.
- The date when we shifted european climate to a new state which does not produce yearly temperatures which we are used to experience is **from about 2068**
- Europeans will not experience another year with yearly mean temperature smaller than 2°C above the historical mean **from 2075 onwards**
The variability of the *global* average of the yearly mean temperature is in a narrow band with 0.5 deg C width.
- no clear change of variability in the climate state transition in the *ssp585* scenario
- a 2°C warmer world is reached in **year 2066**
%% Cell type:markdown id: tags:
### Add-on
Calculating quantiles with `xarray` is also easy. The results can be visualized easily with hvplot.
In the following, we show you how to do that with the historical dataset. We calculate the 98% interval between the 99th and 1st quantile and plot a *quadmesh* plot.
%% Cell type:code id: tags:
``` python
historical_ds=xr_dsets[list(xr_dsets.keys())[0]]
historical_ds=historical_ds.chunk(dict(member_id=-1)).groupby("time.year").mean("time") #.isel(time=slice(0,120)).mean(dim="time")
xr_quantiles=historical_ds.quantile(q=[0.01,0.99],dim="member_id")
xr_quantiles
```
%% Cell type:code id: tags:
``` python
qs=xr_quantiles.sel(quantile=0.99)-xr_quantiles.sel(quantile=0.01)
qs=xr_quantiles.sel(quantile=0.99)-xr_quantiles.sel(quantile=0.01).squeeze()
qs
```
%% Cell type:code id: tags:
``` python
#qs.tas.hvplot.contourf(levels=20, geo=True, coastline=True, width=600)
qs.tas.hvplot.quadmesh(width=600)
#qs.hvplot.quadmesh(
# 'lon', 'lat', 'tas', projection=ccrs.Orthographic(-90, 30),
# global_extent=True, frame_height=540, cmap='viridis'#, coastline=True
#)
#hvplot.save(testplot, "testplot.html")
```
%% Cell type:markdown id: tags:
Variability of the yearly mean temperature varies with location. The yearly mean temperature can be simulated more precisely i.e. varies less
- above oceans than above land
- in tropical than in polar regions
%% Cell type:markdown id: tags:
In a next step, these results should be validated. How well simulated is the climate variability?
- compare it with reanalysis statistics
%% Cell type:markdown id: tags:
### Used data
- https://doi.org/10.22033/ESGF/CMIP6.6595
- https://doi.org/10.22033/ESGF/CMIP6.6705
%% Cell type:markdown id: tags:
We acknowledge the CMIP community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP data of this study were replicated and made available for this study by the DKRZ.”
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Plot ESM data on *unstructured* grids with `psyplot`
This notebook introduces you to the `mapplot` function of the package `psyplot` and its plugin `psy_maps`.
It is suitable to plot maps from data on unstructured grids like the ones from ICON and FESOM.
We therefore search for the corresponding data in the CMIP6 data pool with intake-esm.
Afterwards, we open a file with `xarray` and configure the opened xarray dataset as well as psyplot for a map plot.
This Jupyter notebook is meant to run in the [Jupyterhub](https://jupyterhub.dkrz.de/hub/login?next=%2Fhub%2Fhome) server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/). The DKRZ hosts the CMIP data pool including 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the common geoscience packages. See more information on how to run Jupyter notebooks at DKRZ [here](https://www.dkrz.de/up/systems/mistral/programming/jupyter-notebook).
Running this Jupyter notebook in your premise, which is also known as [client-side](https://en.wikipedia.org/wiki/Client-side) computing, will require that you install the necessary packages
`intake`, `xarray`, `maplotlib`, `psyplot`, `psy_maps`
and either download the data or use the `opendap_url` column of the intake catalog if available.
%% Cell type:markdown id: tags:
### Learning Objectives
- How to access data on an *unstructured* grid from the DKRZ CMIP data pool with `intake-esm`
- How to subset data with `xarray`
- How to visualize the results with `matplotlib`, `psyplot` and `psy_maps`.
%% Cell type:code id: tags:
``` python
import psyplot.project as psy
import matplotlib as mpl
import xarray as xr
import intake
```
%% Cell type:markdown id: tags:
We open a swift catalog from dkrz cloud which is accessible without authentication.
%% Cell type:code id: tags:
``` python
# Path to master catalog on the DKRZ server
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
parent_col=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
list(parent_col)
```
%% Cell type:code id: tags:
``` python
col=parent_col["dkrz_cmip6_disk"]
col.df["uri"]=col.df["uri"].str.replace("lustre/","lustre02/")
```
%% Cell type:markdown id: tags:
In this example, we aim at plotting the **Sea Surface Temperature** `tos` of the upper boundary of the liquid ocean, including temperatures below sea-ice and floating ice shelves from the earth system model **AWI-CM-1-1-MR**.
We therefore search for `tos` in the catalog for monthly frequency. We only use one realization `r1i1p1f1` of one experiment only.
%% Cell type:code id: tags:
``` python
tos=col.search(source_id="AWI-CM-1-1-MR",
experiment_id="ssp370",
variable_id="tos",
table_id="Omon",
member_id="r1i1p1f1")
```
%% Cell type:code id: tags:
``` python
tos.df["uri"].to_list()[0]
```
%% Cell type:markdown id: tags:
We now open the file on the mistral lustre file system. Note that if you work remotely, you could try to use `opendap_url` instead of `path`.
%% Cell type:code id: tags:
``` python
dset = xr.open_dataset(tos.df["uri"].to_list()[0],
decode_cf=True,
chunks={"time":1},
lock=False)
dset
```
%% Cell type:markdown id: tags:
In order to make `tos` plottable, we set the following configuration.
- The `CDI_grid_type` is a keyword for `psyplot`. It must match the *grid type* of the source model.
- Coordinates are not fully recognized by `xarray` so that we have to add some manually (version from Dec 2020).
%% Cell type:code id: tags:
``` python
dset["tos"]["CDI_grid_type"]="unstructured"
coordlist=["vertices_latitude", "vertices_longitude", "lat_bnds", "lon_bnds"]
dset=dset.set_coords([coord for coord in dset.data_vars if coord in coordlist])
```
%% Cell type:markdown id: tags:
The following is based on the [psyplot example](https://psyplot.readthedocs.io/projects/psy-maps/en/latest/examples/example_ugrid.html#gallery-examples-example-ugrid-ipynb). We set a resoltion for the land sea mask `lsm` and a color map via `cmap`.
%% Cell type:code id: tags:
``` python
psy.rcParams['plotter.maps.xgrid'] = False
psy.rcParams['plotter.maps.ygrid'] = False
mpl.rcParams['figure.figsize'] = [10, 8.]
```
%% Cell type:code id: tags:
``` python
def plot_unstructured():
iconplot11=psy.plot.mapplot(
dset, name="tos", cmap='rainbow',
clabel=dset["tos"].description,
stock_img=True, lsm='50m')
```
%% Cell type:markdown id: tags:
We now do the same with a smaller subset to highlight the fine resolution and the structure of the AWI ocean model FESOM.
We first *subset* the data because otherwise plotting takes too long. We choose indices of dimensions with the `xarray` function `isel`. We select a slice of two time steps and focus on a region Ireland. We have to save the data to an intermediate file `test.nc` because otherwise we receive an error.
%% Cell type:code id: tags:
``` python
dset["lat"]=dset.lat.load()
dset["lon"]=dset.lon.load()
dset2 = dset.isel(time=slice(1,2)).where( (dset.lon > -10. ) &
(dset.lon < 50. ) &
(dset.lat > 40. ) &
(dset.lat < 70. ), drop=True).drop("time_bnds")
```
%% Cell type:code id: tags:
``` python
dset2
```
%% Cell type:code id: tags:
``` python
dset2.to_netcdf("test.nc")
```
%% Cell type:code id: tags:
``` python
dset2.close()
```
%% Cell type:code id: tags:
``` python
dset=xr.open_dataset("test.nc",
decode_cf=True)
```
%% Cell type:code id: tags:
``` python
dset["tos"]["CDI_grid_type"]="unstructured"
coordlist=["vertices_latitude", "vertices_longitude", "lat_bnds", "lon_bnds"]
dset=dset.set_coords([coord for coord in dset.data_vars if coord in coordlist])
```
%% Cell type:code id: tags:
``` python
psy.plot.mapplot(
dset, name="tos", cmap='rainbow',
lonlatbox='Ireland',
clabel=dset["tos"].description,
stock_img=True,
lsm='50m',
datagrid=dict(c='b', lw=0.2)).show()
```
%% Cell type:code id: tags:
``` python
dset.close()
```
%% Cell type:markdown id: tags:
### Used data
- [Semmler et al., 2019: AWI AWI-CM1.1MR model output prepared for CMIP6](https://doi.org/10.22033/ESGF/CMIP6.2803)
%% Cell type:markdown id: tags:
We acknowledge the CMIP community for providing the climate model data, retained and globally distributed in the framework of the ESGF. The CMIP data of this study were replicated and made available for this study by the DKRZ.”
......