Commit 7f6141a4 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Updated tuts

parent d4e67769
Pipeline #19326 passed with stage
in 18 minutes and 36 seconds
...@@ -216,27 +216,11 @@ ...@@ -216,27 +216,11 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
"#\n",
"#only for the web page we need to take the original link:\n",
"dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])" "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])"
] ]
}, },
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dkrz_catalog.dkrz_nextgems_disk.df[\"uri\"].values[0]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"ls !/work/ka1081/Catalogs/dyamond-nextgems.json "
]
},
{ {
"cell_type": "markdown", "cell_type": "markdown",
"metadata": {}, "metadata": {},
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections # Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Overview ```{admonition} Overview
:class: dropdown :class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg) ![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections 🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections
⌛ **time_estimation**: "30min" ⌛ **time_estimation**: "30min"
☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`, at least 10GB memory. ☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`, at least 10GB memory.
© **contributors**: k204210 © **contributors**: k204210
⚖ **license**: ⚖ **license**:
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Agenda ```{admonition} Agenda
:class: tip :class: tip
In this part, you learn In this part, you learn
1. [Motivation of intake-esm](#motivation) 1. [Motivation of intake-esm](#motivation)
1. [Features of intake and intake-esm](#features) 1. [Features of intake and intake-esm](#features)
1. [Browse through catalogs](#browse) 1. [Browse through catalogs](#browse)
1. [Data access via intake-esm](#dataaccess) 1. [Data access via intake-esm](#dataaccess)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a> <a class="anchor" id="motivation"></a>
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html). We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html).
## Motivation of intake-esm ## Motivation of intake-esm
> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it. > Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
> `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**. > `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For intake users, many data preparation tasks **are no longer necessary**. They do not need to know: For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:
- 🌍 where data is saved - 🌍 where data is saved
- 🪧 how data is saved - 🪧 how data is saved
- 📤 how data should be loaded - 📤 how data should be loaded
but still can search, discover, access and load data of a project. but still can search, discover, access and load data of a project.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="features"></a> <a class="anchor" id="features"></a>
## Features of intake and intake-esm ## Features of intake and intake-esm
Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data. Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.
- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation - display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without - browse 🔍 through the catalog and select your data without
- being next to the data (e.g. logged in on dkrz's luv) - being next to the data (e.g. logged in on dkrz's luv)
- knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates - knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁 - open climate data in an analysis ready dictionary of `xarray` datasets 🎁
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
All required information for searching, accessing and loading the catalog's data is configured within the catalogs: All required information for searching, accessing and loading the catalog's data is configured within the catalogs:
- 🌍 where data is saved - 🌍 where data is saved
* users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax * users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
* data of different platforms (cloud or disk) can be combined in one catalog * data of different platforms (cloud or disk) can be combined in one catalog
* on mid term, intake catalogs can be **a single point of access** * on mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved - 🪧 how data is saved
* users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format. * users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
* catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data. * catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤 how data should be loaded - 📤 how data should be loaded
* users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design. * users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
* with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded * with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage. In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6. CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ. The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="terminology"></a> <a class="anchor" id="terminology"></a>
## Terminology: **Catalog**, **Catalog file** and **Collection** ## Terminology: **Catalog**, **Catalog file** and **Collection**
We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms: We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:
- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a> - a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a>
- is a `.yaml` file - is a `.yaml` file
- can be opened with `open_catalog`, e.g.: - can be opened with `open_catalog`, e.g.:
```python ```python
intake.open_catalog(["https://dkrz.de/s/intake"]) intake.open_catalog(["https://dkrz.de/s/intake"])
``` ```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. <a class="anchor" id="intakedriver"></a> - **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. <a class="anchor" id="intakedriver"></a>
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a> - a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
- a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a> - a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
- an **asset** is most often a file. <a class="anchor" id="asset"></a> - an **asset** is most often a file. <a class="anchor" id="asset"></a>
- a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a> - a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a>
- the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.: - the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json ```json
"catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz" "catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
``` ```
```{note} ```{note}
The term *collection* is often used synonymically for [catalog](#catalog). The term *collection* is often used synonymically for [catalog](#catalog).
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
- a *intake-esm* catalog 📖 is a `.json` file and can be opened with intake-esm's function `intake.open_esm_datastore()`, e.g: - a *intake-esm* catalog 📖 is a `.json` file and can be opened with intake-esm's function `intake.open_esm_datastore()`, e.g:
```python ```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json") intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#note that intake_esm is imported with `import intake` as a plugin #note that intake_esm is imported with `import intake` as a plugin
import intake import intake
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="browse"></a> <a class="anchor" id="browse"></a>
## Open and browse through catalogs ## Open and browse through catalogs
We begin with using only *intake* functions for catalogs. Afterwards, we continue with concrete *intake-esm* utilites. We begin with using only *intake* functions for catalogs. Afterwards, we continue with concrete *intake-esm* utilites.
intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`. intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
<mark> You only need to remember one URL as the *single point of access* for DKRZ's intake catalogs: The DKRZ top level catalog can be accessed via dkrz.de/s/intake . Intake will only follow this *redirect* if a specific parser is activated. This can be done by providing the url in a list.</mark> <mark> You only need to remember one URL as the *single point of access* for DKRZ's intake catalogs: The DKRZ top level catalog can be accessed via dkrz.de/s/intake . Intake will only follow this *redirect* if a specific parser is activated. This can be done by providing the url in a list.</mark>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"]) #dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for the web page we need to take the original link:
dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"]) dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
``` ```
%% Cell type:code id: tags:
``` python
dkrz_catalog.dkrz_nextgems_disk.df["uri"].values[0]
```
%% Cell type:code id: tags:
``` python
ls !/work/ka1081/Catalogs/dyamond-nextgems.json
```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{note} ```{note}
Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**. Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can look into the catalog with `print` and `list` We can look into the catalog with `print` and `list`
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ. Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
list(dkrz_catalog) list(dkrz_catalog)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
All these catalogs are **intake-esm** catalogs. You can find this information via the `_entries` attribute. The line `plugin: ['esm_datastore'] All these catalogs are **intake-esm** catalogs. You can find this information via the `_entries` attribute. The line `plugin: ['esm_datastore']
` refers to **intake-esm**'s function `open_esm_datastore()`. ` refers to **intake-esm**'s function `open_esm_datastore()`.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(dkrz_catalog._entries) print(dkrz_catalog._entries)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template: The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]` `dkrz_${project}_${store}[_${auxiliary_catalog}]`
where where
- **project** can be one of the *model intercomparison project*, e.g. `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`. - **project** can be one of the *model intercomparison project*, e.g. `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of: - **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. Working next to the data on the file system will be the fastest way possible. - `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. Working next to the data on the file system will be the fastest way possible.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage. - `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing. - `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid* - **auxiliary_catalog** can be *grid*
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Why that convention?**: **Why that convention?**:
- **dkrz**: Assume you work with internation collections. Than it may become important that you know from where the data comes, e.g. if only pathes on a local file system are given as the locations of the data. - **dkrz**: Assume you work with internation collections. Than it may become important that you know from where the data comes, e.g. if only pathes on a local file system are given as the locations of the data.
- **project**: Project's data standards differ from each other so that different catalog attributes are required to identify a single asset in a project data base. - **project**: Project's data standards differ from each other so that different catalog attributes are required to identify a single asset in a project data base.
- **store**: Intake-esm cannot load data from all stores. Before data from the archive can be accessed, it has to be retrieved. Therefore, the opening function is not working for catalog merged for all stores. - **store**: Intake-esm cannot load data from all stores. Before data from the archive can be accessed, it has to be retrieved. Therefore, the opening function is not working for catalog merged for all stores.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Best practice for naming catalogs**: **Best practice for naming catalogs**:
- Use small letters for all values - Use small letters for all values
- Do **NOT** use `_` as a separator in values - Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond") - Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We could directly start to work with **two intake catalog** at the same time. We could directly start to work with **two intake catalog** at the same time.
Let's have a look into a master catalog of [Pangeo](https://pangeo.io/): Let's have a look into a master catalog of [Pangeo](https://pangeo.io/):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml") pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pangeo pangeo
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
list(pangeo) list(pangeo)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply: While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pangeo.climate pangeo.climate
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Pangeo's ESM collections are one level deeper in the catalog tree: Pangeo's ESM collections are one level deeper in the catalog tree:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
list(pangeo.climate) list(pangeo.climate)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### The `intake-esm` catalogs ### The `intake-esm` catalogs
We now look into a catalog which is opened by the plugin `intake-esm`. We now look into a catalog which is opened by the plugin `intake-esm`.
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from). > An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog. Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{note} ```{note}
The project catalogs contain only valid and current project data. They are constantly updated. The project catalogs contain only valid and current project data. They are constantly updated.
If your work is based on a catalog and a subset of the data from it, be sure to save that subset so you can later compare your database to the most current catalog. If your work is based on a catalog and a subset of the data from it, be sure to save that subset so you can later compare your database to the most current catalog.
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col) print(esm_col)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names. `intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.
The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.1 Mio Files** in 2022. The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.1 Mio Files** in 2022.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table: The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_col.df.head() esm_col.df.head()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can find out details about `esm_col` with the object's attributes. `esm_col.esmcol_data` contains all information given in the `JSON` file. We can also focus on some specific attributes. We can find out details about `esm_col` with the object's attributes. `esm_col.esmcol_data` contains all information given in the `JSON` file. We can also focus on some specific attributes.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#esm_col.esmcol_data #esm_col.esmcol_data
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print("What is this catalog about? \n" + esm_col.esmcol_data["description"]) print("What is this catalog about? \n" + esm_col.esmcol_data["description"])
# #
print("The link to the data base: "+ esm_col.esmcol_data["catalog_file"]) print("The link to the data base: "+ esm_col.esmcol_data["catalog_file"])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Advanced: To find out how many datasets are available, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one): Advanced: To find out how many datasets are available, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat = esm_col.df.drop(['uri','time_range'],1).drop_duplicates(keep="first") cat = esm_col.df.drop(['uri','time_range'],1).drop_duplicates(keep="first")
print(len(cat)) print(len(cat))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Browse through the data of the ESM collection ### Browse through the data of the ESM collection
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You will browse the collection technically by setting values the **column names** of the underlying table. Per default, the catalog was loaded with all cmip6 attributes/columns that define the CMIP6 data standard: You will browse the collection technically by setting values the **column names** of the underlying table. Per default, the catalog was loaded with all cmip6 attributes/columns that define the CMIP6 data standard:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_col.df.columns esm_col.df.columns
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
These are configured in the top level catalog so you <mark> do not need to open the catalog to see the columns </mark> These are configured in the top level catalog so you <mark> do not need to open the catalog to see the columns </mark>
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Most of the time, we want to set more than one attribute for a search. Therefore, we define a query `dict`ionary and use the `search` function of the `esm_col` object. In the following case, we look for temperature at surface in monthly resolution for 3 different experiments: Most of the time, we want to set more than one attribute for a search. Therefore, we define a query `dict`ionary and use the `search` function of the `esm_col` object. In the following case, we look for temperature at surface in monthly resolution for 3 different experiments:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
query = dict( query = dict(
variable_id="tas", variable_id="tas",
table_id="Amon", table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"]) experiment_id=["piControl", "historical", "ssp370"])
# piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years. # piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years.
# historical = historical Simulation, 1850-2014 # historical = historical Simulation, 1850-2014
# ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100 # ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100
cat = esm_col.search(**query) cat = esm_col.search(**query)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat cat
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We could also use *Wildcards*. For example, in order to find out which ESMs of the institution *MPI-M* have produced data for our subset: We could also use *Wildcards*. For example, in order to find out which ESMs of the institution *MPI-M* have produced data for our subset:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat.search(source_id="MPI-ES*") cat.search(source_id="MPI-ES*")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
We can find out which models have submitted data for at least one of them by: We can find out which models have submitted data for at least one of them by:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat.unique(["source_id"]) cat.unique(["source_id"])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If we instead look for the models that have submitted data for ALL experiments, we use the `require_all_on` keyword argument: If we instead look for the models that have submitted data for ALL experiments, we use the `require_all_on` keyword argument:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat = esm_col.search(require_all_on=["source_id"], **query) cat = esm_col.search(require_all_on=["source_id"], **query)
cat.unique(["source_id"]) cat.unique(["source_id"])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Note that only the combination of a `variable_id` and a `table_id` is unique in CMIP6. If you search for `tas` in all tables, you will find many entries more: Note that only the combination of a `variable_id` and a `table_id` is unique in CMIP6. If you search for `tas` in all tables, you will find many entries more:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
query = dict( query = dict(
variable_id="tas", variable_id="tas",
# table_id="Amon", # table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"]) experiment_id=["piControl", "historical", "ssp370"])
cat = esm_col.search(**query) cat = esm_col.search(**query)
cat.unique(["table_id"]) cat.unique(["table_id"])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data. Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### How to load more columns ### How to load more columns
If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The opendap_url is an *additional* column that can also be loaded. If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The opendap_url is an *additional* column that can also be loaded.
We can define 3 different column name types for the usage of intake catalogs: We can define 3 different column name types for the usage of intake catalogs:
1. **Default** attributes which are loaded from the main catalog and which can be seen via `_entries[CATNAME]._open_args`. 1. **Default** attributes which are loaded from the main catalog and which can be seen via `_entries[CATNAME]._open_args`.
2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata["parameters"]["cataloonie_columns"]` 2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata["parameters"]["cataloonie_columns"]`
3. **Additional** attributes which are not necessary to identify a single asset but helpful for users. You can find these via 3. **Additional** attributes which are not necessary to identify a single asset but helpful for users. You can find these via
`dkrz_catalog.metadata["parameters"]["additional_PROJECT_columns"]` `dkrz_catalog.metadata["parameters"]["additional_PROJECT_columns"]`
So, for CMIP6 there are: So, for CMIP6 there are:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
dkrz_catalog.metadata["parameters"]["additional_cmip6_columns"] dkrz_catalog.metadata["parameters"]["additional_cmip6_columns"]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{tip} ```{tip}
You may find *variable_id*s in the catalog which are not obvious or abbrevations for a clear variable name. In that cases you would need additional information like a *long_name* of the variable. For CMIP6, we provided the catalog with this `long_name` so you could add it as a column. You may find *variable_id*s in the catalog which are not obvious or abbrevations for a clear variable name. In that cases you would need additional information like a *long_name* of the variable. For CMIP6, we provided the catalog with this `long_name` so you could add it as a column.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
There is a lot of redundancy in the columns. That is because they exist to be conform to other kind of standards. This will simplify merging catalogs across projects. There is a lot of redundancy in the columns. That is because they exist to be conform to other kind of standards. This will simplify merging catalogs across projects.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
So, this is the instruction how to open the catalog with additional columns: So, this is the instruction how to open the catalog with additional columns:
1. create a combination of all your required columns: 1. create a combination of all your required columns:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]+["opendap_url"] cols=dkrz_catalog._entries["dkrz_cmip6_disk"]._open_args["csv_kwargs"]["usecols"]+["opendap_url"]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
2. open the **dkrz_cmip6_disk** catalog with the `csv_kwargs` keyword argument in this way: 2. open the **dkrz_cmip6_disk** catalog with the `csv_kwargs` keyword argument in this way:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols)) esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{warning} ```{warning}
The number of columns determines the required memory. The number of columns determines the required memory.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{tip} ```{tip}
If you work from remote and also want to access the data remotely, load the *opendap_url* column. If you work from remote and also want to access the data remotely, load the *opendap_url* column.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="dataaccess"></a> <a class="anchor" id="dataaccess"></a>
## Access and load data of the ESM collection ## Access and load data of the ESM collection
With the power of `xarray`, `intake` can load your subset into a `dict`ionary of datasets. We therefore focus on the data of `MPI-ESM1-2-LR`: With the power of `xarray`, `intake` can load your subset into a `dict`ionary of datasets. We therefore focus on the data of `MPI-ESM1-2-LR`:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#case insensitive? #case insensitive?
query = dict( query = dict(
variable_id="tas", variable_id="tas",
table_id="Amon", table_id="Amon",
source_id="MPI-ESM1-2-HR", source_id="MPI-ESM1-2-HR",
experiment_id="historical") experiment_id="historical")
cat = esm_col.search(**query) cat = esm_col.search(**query)
cat cat
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can find out which column intake uses to access the data via the following keyword: You can find out which column intake uses to access the data via the following keyword:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(cat.path_column_name) print(cat.path_column_name)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As we are working with the *_disk* catalog, **uri** contains *pathes* to the files on filesystem. If you are working from remote, you would have As we are working with the *_disk* catalog, **uri** contains *pathes* to the files on filesystem. If you are working from remote, you would have
- to change the catalog's attribute `path_column_name` to *opendap_url*. - to change the catalog's attribute `path_column_name` to *opendap_url*.
- to reassign the `format` column from *netcdf* to *opendap* - to reassign the `format` column from *netcdf* to *opendap*
as follows: as follows:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#cat.path_column_name="opendap_url" #cat.path_column_name="opendap_url"
#newdf=cat.df.copy() #newdf=cat.df.copy()
#newdf.loc[:,"format"]="opendap" #newdf.loc[:,"format"]="opendap"
#cat.df=newdf #cat.df=newdf
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
**Intake-ESM** natively supports the following data formats or access formats (since opendap is not really a file format): **Intake-ESM** natively supports the following data formats or access formats (since opendap is not really a file format):
- netcdf - netcdf
- opendap - opendap
- zarr - zarr
You can also open **grb** data but right now only by specifying xarray's attribute *engine* in the *open* function which is defined in the following. I.e., it does not make a difference if you specify **grb** as format. You can also open **grb** data but right now only by specifying xarray's attribute *engine* in the *open* function which is defined in the following. I.e., it does not make a difference if you specify **grb** as format.
You can find an example in the *era5* notebook. You can find an example in the *era5* notebook.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The function to open data is `to_dataset_dict`. The function to open data is `to_dataset_dict`.
We recommend to set a keyword argument `cdf_kwargs` for the chunk size of the variable's data array. Otherwise, `xarray` may choose too large chunks. Most often, your data contains a time dimension so that you could set `cdf_kwargs={"chunks":{"time":1}}`. We recommend to set a keyword argument `cdf_kwargs` for the chunk size of the variable's data array. Otherwise, `xarray` may choose too large chunks. Most often, your data contains a time dimension so that you could set `cdf_kwargs={"chunks":{"time":1}}`.
If your collection contains **zarr** formatted data, you need to add another keyword argument `zarr_kwargs`. <mark> The trick is: You can just specify both. Intake knows from the `format` column which *kwargs* should be taken. If your collection contains **zarr** formatted data, you need to add another keyword argument `zarr_kwargs`. <mark> The trick is: You can just specify both. Intake knows from the `format` column which *kwargs* should be taken.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
xr_dict = cat.to_dataset_dict(cdf_kwargs=dict(chunks=dict(time=1)), xr_dict = cat.to_dataset_dict(cdf_kwargs=dict(chunks=dict(time=1)),
zarr_kwargs=dict(consolidated=True, zarr_kwargs=dict(consolidated=True,
decode_times=True, decode_times=True,
use_cftime=True) use_cftime=True)
) )
xr_dict xr_dict
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`Intake` was able to aggregate many files into only one dataset: `Intake` was able to aggregate many files into only one dataset:
- The `time_range` column was used to **concat** data along the `time` dimension - The `time_range` column was used to **concat** data along the `time` dimension
- The `member_id` column was used to generate a new dimension - The `member_id` column was used to generate a new dimension
The underlying `dask` package will only load the data into memory if needed. The underlying `dask` package will only load the data into memory if needed.
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
They **keys** of the dictionary are made with column values defined in the *aggregation_control* of the **intake-esm** catalog. These will determine the **key_template**. The corresponding commands are: They **keys** of the dictionary are made with column values defined in the *aggregation_control* of the **intake-esm** catalog. These will determine the **key_template**. The corresponding commands are:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
print(cat.esmcol_data["aggregation_control"]["groupby_attrs"]) print(cat.esmcol_data["aggregation_control"]["groupby_attrs"])
# #
print(cat.key_template) print(cat.key_template)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
You can work with these keys **directly** on the **intake-esm** catalog which will give you an overview over all columns (too long for the web page): You can work with these keys **directly** on the **intake-esm** catalog which will give you an overview over all columns (too long for the web page):
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#cat["CMIP.MPI-ESM1-2-HR.historical.Amon.gn"] #cat["CMIP.MPI-ESM1-2-HR.historical.Amon.gn"]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
If we are only interested in the **first** dataset of the dictionary, we can *pop it out*: If we are only interested in the **first** dataset of the dictionary, we can *pop it out*:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
xr_dset = xr_dict.popitem()[1] xr_dset = xr_dict.popitem()[1]
xr_dset xr_dset
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### Pangeo's data store #### Pangeo's data store
Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before. Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pangeo_cmip6=pangeo.climate.cmip6_gcs pangeo_cmip6=pangeo.climate.cmip6_gcs
cat = pangeo_cmip6.search(**query) cat = pangeo_cmip6.search(**query)
cat cat
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
There are differences between the collections because There are differences between the collections because
- Pangeo provides files in *consolidated*, `zarr` formatted datasets which correspond to `zstore` entries in the catalog instead of `path`s or `opendap_url`s. - Pangeo provides files in *consolidated*, `zarr` formatted datasets which correspond to `zstore` entries in the catalog instead of `path`s or `opendap_url`s.
- The `zarr` datasets are already aggregated over time so there is no need for a `time_range` column - The `zarr` datasets are already aggregated over time so there is no need for a `time_range` column
If we now open the data with `intake`, we have to specify keyword arguments as follows: If we now open the data with `intake`, we have to specify keyword arguments as follows:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
dset_dict = cat.to_dataset_dict( dset_dict = cat.to_dataset_dict(
zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True} zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
) )
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
dset_dict dset_dict
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
`dset_dict` and `xr_dict` are the same. You succesfully did the intake tutorial! `dset_dict` and `xr_dict` are the same. You succesfully did the intake tutorial!
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### Making a quick plot ### Making a quick plot
The following line exemplifies the ease of the intake's data processing library chain. On the web page, the interactivity will not work as all plots would have to be loaded which is not feasible. The following line exemplifies the ease of the intake's data processing library chain. On the web page, the interactivity will not work as all plots would have to be loaded which is not feasible.
For more examples, check out the **use cases** on that web page. For more examples, check out the **use cases** on that web page.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import hvplot.xarray import hvplot.xarray
xr_dset["tas"].hvplot.quadmesh(width=600) xr_dset["tas"].hvplot.quadmesh(width=600)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{seealso} ```{seealso}
This tutorial is part of a series on `intake`: This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html) * [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html) * [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html) * [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html) * [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html) * [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page. - You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
...@@ -72,7 +72,10 @@ ...@@ -72,7 +72,10 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"import intake\n", "import intake\n",
"dkrz_cdp=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", "#dkrz_cdp=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
"#\n",
"#only for generating the web page we need to take the original link:\n",
"dkrz_cdp=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
"esm_dkrz=dkrz_cdp.dkrz_cmip6_disk" "esm_dkrz=dkrz_cdp.dkrz_cmip6_disk"
] ]
}, },
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Intake II - modifying intake-esm data bases and save new catalogs # Intake II - modifying intake-esm data bases and save new catalogs
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Overview ```{admonition} Overview
:class: dropdown :class: dropdown
![Level](https://img.shields.io/badge/Level-Intermediate-orange.svg) ![Level](https://img.shields.io/badge/Level-Intermediate-orange.svg)
🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow 🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow
⌛ **time_estimation**: "40min" ⌛ **time_estimation**: "40min"
☑️ **requirements**: ☑️ **requirements**:
- intake I - intake I
- [pandas](https://pandas.pydata.org/) - [pandas](https://pandas.pydata.org/)
© **contributors**: k204210 © **contributors**: k204210
⚖ **license**: ⚖ **license**:
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Agenda ```{admonition} Agenda
:class: tip full-width :class: tip full-width
Based on DKRZ's CMIP6 catalog, you learn in this part, how to Based on DKRZ's CMIP6 catalog, you learn in this part, how to
1. [Modify the data base of the catalog](#modify) 1. [Modify the data base of the catalog](#modify)
- How to rename values in a column - How to rename values in a column
1. [Make complex searches](#complex) 1. [Make complex searches](#complex)
- Which member was produced the most? - Which member was produced the most?
1. [Save subset catalogs](#save) 1. [Save subset catalogs](#save)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The data base of intake-esm catalogs is processed as a *pandas DataFrame*. In section 1 and 2, you will learn to modify this data base and make complex searches with **pandas commands**. The tutorial covers some examples - if you aim at a deeper understanding of pandas we recommend you to do the extensive pandas tutorials from its own documentation. The data base of intake-esm catalogs is processed as a *pandas DataFrame*. In section 1 and 2, you will learn to modify this data base and make complex searches with **pandas commands**. The tutorial covers some examples - if you aim at a deeper understanding of pandas we recommend you to do the extensive pandas tutorials from its own documentation.
```{note} ```{note}
[Pandas](https://pandas.pydata.org/docs/user_guide/index.html) is a powerful data analysis tool and with its *DataFrame* class, users are enabled to process table-like data fast and intuitively. [Pandas](https://pandas.pydata.org/docs/user_guide/index.html) is a powerful data analysis tool and with its *DataFrame* class, users are enabled to process table-like data fast and intuitively.
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import intake import intake
dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"]) #dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"])
#
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk esm_dkrz=dkrz_cdp.dkrz_cmip6_disk
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="modify"></a> <a class="anchor" id="modify"></a>
## Modify the data base ## Modify the data base
Assume you want to rename a short name like `tas` into a long name like `temperature`. Assume you want to rename a short name like `tas` into a long name like `temperature`.
We define a dictionary for renaming: We define a dictionary for renaming:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
rename_dict={"tas":"Temperature", rename_dict={"tas":"Temperature",
"pr":"Precipitation"} "pr":"Precipitation"}
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
For all items in the rename dictionary, we will reset the value in the underlying DataFrame *inplace*. We iterate over the dictionary with `.items` which returns key and value separately for all entries in the dictionary. For all items in the rename dictionary, we will reset the value in the underlying DataFrame *inplace*. We iterate over the dictionary with `.items` which returns key and value separately for all entries in the dictionary.
With the `.loc` attribute of the catalog, we can access a slice inside the DataFrame. The first argument of `loc` is the row indexer, the second is the column indexer. With the `.loc` attribute of the catalog, we can access a slice inside the DataFrame. The first argument of `loc` is the row indexer, the second is the column indexer.
- The row index condition is: all rows where the variable is as the key of our dictionary, e.g. variable_id == tas. In general terms, that is `esm_dkrz.df["variable_id"]==short_name` - The row index condition is: all rows where the variable is as the key of our dictionary, e.g. variable_id == tas. In general terms, that is `esm_dkrz.df["variable_id"]==short_name`
- The column index is: easy, just `variable_id` - The column index is: easy, just `variable_id`
Therefore, our code looks like: Therefore, our code looks like:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
for short_name,long_name in rename_dict.items(): for short_name,long_name in rename_dict.items():
esm_dkrz.df.loc[esm_dkrz.df["variable_id"]==short_name, "variable_id"]=long_name esm_dkrz.df.loc[esm_dkrz.df["variable_id"]==short_name, "variable_id"]=long_name
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now, you can search "*Temperature*": Now, you can search "*Temperature*":
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_dkrz.search(variable_id="Temperature") esm_dkrz.search(variable_id="Temperature")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{warning} ```{warning}
In this example, we changed `variable_id`s which are predefined by the CMIP6 data standard for good reasons. Do NOT consider renaming as good practice. Do NOT share a catalog with renamed variables. In this example, we changed `variable_id`s which are predefined by the CMIP6 data standard for good reasons. Do NOT consider renaming as good practice. Do NOT share a catalog with renamed variables.
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="complex"></a> <a class="anchor" id="complex"></a>
### Complex searches, e.g. for combinations of attributes: ### Complex searches, e.g. for combinations of attributes:
If you want to know what the *"preferred"* ensemble member assignment of one specific or all ESMs is, you can do that by **grouping** the underlying data base with `groupby`. It takes columns as arguments which are used for grouping. I.e., `groupby` creates a new combined index with all unique combination of the specified columns. We have to save the returned object as we continue to work with the grouped dataframe: If you want to know what the *"preferred"* ensemble member assignment of one specific or all ESMs is, you can do that by **grouping** the underlying data base with `groupby`. It takes columns as arguments which are used for grouping. I.e., `groupby` creates a new combined index with all unique combination of the specified columns. We have to save the returned object as we continue to work with the grouped dataframe:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
grouped_df=esm_dkrz.df.groupby(["source_id","member_id"]) grouped_df=esm_dkrz.df.groupby(["source_id","member_id"])
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{note} ```{note}
if you return the underlying `df` variable, it will not be in the context of the catalog any longer and instead, it will be only the DataFrame which corresponds to the data base of the catalog if you return the underlying `df` variable, it will not be in the context of the catalog any longer and instead, it will be only the DataFrame which corresponds to the data base of the catalog
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The returned `grouped_df` is only the start of an operation. Pandas has named the workflow [split-apply-combine](https://xarray.pydata.org/en/stable/user-guide/groupby.html). We started to **split** the dataframe into **groups**. The returned `grouped_df` is only the start of an operation. Pandas has named the workflow [split-apply-combine](https://xarray.pydata.org/en/stable/user-guide/groupby.html). We started to **split** the dataframe into **groups**.
- **apply**: We will calculate now the number of entries for all groups, which can be easily done with `size()`. `size()` creates a new column for all groups named *"counts"* which contains the results. - **apply**: We will calculate now the number of entries for all groups, which can be easily done with `size()`. `size()` creates a new column for all groups named *"counts"* which contains the results.
- **combine**: Afterwards, we will *reset the index*. That means, that the columns which were used for the groups and which were indexes in the grouped DataFrame return to be columns. The return of `reset_index` is a regular DataFrame. - **combine**: Afterwards, we will *reset the index*. That means, that the columns which were used for the groups and which were indexes in the grouped DataFrame return to be columns. The return of `reset_index` is a regular DataFrame.
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
grouped_df_size=grouped_df.size().reset_index(name='counts') grouped_df_size=grouped_df.size().reset_index(name='counts')
grouped_df_size grouped_df_size
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
With that DataFrame, we can already display statistics of ensemble member for specific sources. If we would like to know the occurences of specific ensemble member of the source *MPI-ESM1-2-HR* only, we can subselect this source from the data very easily. Our condition `grouped_df_size["source_id"]=="MPI-ESM1-2-HR"` is just put in brackets of the Dataframe: With that DataFrame, we can already display statistics of ensemble member for specific sources. If we would like to know the occurences of specific ensemble member of the source *MPI-ESM1-2-HR* only, we can subselect this source from the data very easily. Our condition `grouped_df_size["source_id"]=="MPI-ESM1-2-HR"` is just put in brackets of the Dataframe:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
grouped_df_size_mpi=grouped_df_size[grouped_df_size["source_id"]=="MPI-ESM1-2-HR"] grouped_df_size_mpi=grouped_df_size[grouped_df_size["source_id"]=="MPI-ESM1-2-HR"]
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
The DataFrame has [Plot options](https://xarray.pydata.org/en/stable/user-guide/plotting.html) which allows us to directly create a figure of our data. One nice way to plot the **counts** of ensemble member for MPI-ESM1-2-HR is to use the `bar()` plot: The DataFrame has [Plot options](https://xarray.pydata.org/en/stable/user-guide/plotting.html) which allows us to directly create a figure of our data. One nice way to plot the **counts** of ensemble member for MPI-ESM1-2-HR is to use the `bar()` plot:
```{note} ```{note}
We will use [hvplot](https://hvplot.holoviz.org/user_guide/Plotting.html) because it can create interactive figures. The same plot can be created wihtout the `.hvplot` part of the command. You can also check for other plot plugins of Pandas. We will use [hvplot](https://hvplot.holoviz.org/user_guide/Plotting.html) because it can create interactive figures. The same plot can be created wihtout the `.hvplot` part of the command. You can also check for other plot plugins of Pandas.
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import hvplot.pandas import hvplot.pandas
grouped_df_size_mpi.hvplot.bar(x="member_id") grouped_df_size_mpi.hvplot.bar(x="member_id")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
But how is it in general for all sources? For that, we will do another **split-apply-combine** to get our target measure: But how is it in general for all sources? For that, we will do another **split-apply-combine** to get our target measure:
- **split**: for all unique member (`groupby("member_id")`),... - **split**: for all unique member (`groupby("member_id")`),...
- **apply**: ... we calculate the mean `counts` (`mean("counts")`) - **apply**: ... we calculate the mean `counts` (`mean("counts")`)
- **combine**: and sort the values with `sort_values("counts", ascending=False)` - **combine**: and sort the values with `sort_values("counts", ascending=False)`
We can do that in one line: We can do that in one line:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
grouped_df_member=grouped_df_size.groupby("member_id").mean("counts").sort_values("counts", ascending=False) grouped_df_member=grouped_df_size.groupby("member_id").mean("counts").sort_values("counts", ascending=False)
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
As there are thousands of unique ensemble member values, we should only plot some, e.g. the top 10: As there are thousands of unique ensemble member values, we should only plot some, e.g. the top 10:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import hvplot.pandas import hvplot.pandas
grouped_df_member.iloc[0:10,:].hvplot.bar() grouped_df_member.iloc[0:10,:].hvplot.bar()
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="subset"></a> <a class="anchor" id="subset"></a>
## Save a catalog subset as a new catalog ## Save a catalog subset as a new catalog
```{admonition} Tip ```{admonition} Tip
:class: Tip :class: Tip
We highly recommend that you save the subset of the catalog which you use in your analysis. Catalogs are often not as stable as they should be. With a local copy, you can ensure whether the original source has changed. We highly recommend that you save the subset of the catalog which you use in your analysis. Catalogs are often not as stable as they should be. With a local copy, you can ensure whether the original source has changed.
``` ```
If we want to save our subset catalog which only contains metadata, we save the search result: If we want to save our subset catalog which only contains metadata, we save the search result:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_subset=esm_dkrz.search(variable_id="Temperature") esm_subset=esm_dkrz.search(variable_id="Temperature")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Secondly, intake esm comes up with the `serialize()` function. The only argument is the **name** of the catalog which will be used as filename. It writes the two parts of the catalog either together in a `.json` file: Secondly, intake esm comes up with the `serialize()` function. The only argument is the **name** of the catalog which will be used as filename. It writes the two parts of the catalog either together in a `.json` file:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_subset.serialize("esm_subset") esm_subset.serialize("esm_subset")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Or in two seperated files if we provide `catalog_type=file` as a second argument. The `test.json` has over 100 MB. If we save the data base in a separate `.csv.gz` file, we reduce that to 2MB: Or in two seperated files if we provide `catalog_type=file` as a second argument. The `test.json` has over 100 MB. If we save the data base in a separate `.csv.gz` file, we reduce that to 2MB:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
esm_subset.serialize("esm_subset", catalog_type="file") esm_subset.serialize("esm_subset", catalog_type="file")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now, we can open the catalog from disk: Now, we can open the catalog from disk:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
intake.open_esm_datastore("esm_subset.json") intake.open_esm_datastore("esm_subset.json")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{seealso} ```{seealso}
This tutorial is part of a series on `intake`: This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html) * [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html) * [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html) * [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html) * [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html) * [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page. - You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
``` ```
......
...@@ -57,7 +57,9 @@ ...@@ -57,7 +57,9 @@
"outputs": [], "outputs": [],
"source": [ "source": [
"import intake\n", "import intake\n",
"dkrz_cdp=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
"#only for generating the web page we need to take the original link:\n",
"dkrz_cdp=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
"esm_dkrz=dkrz_cdp.dkrz_cmip6_disk" "esm_dkrz=dkrz_cdp.dkrz_cmip6_disk"
] ]
}, },
......
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# Intake IV - preprocessing and derived variables # Intake IV - preprocessing and derived variables
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Overview ```{admonition} Overview
:class: dropdown :class: dropdown
![Level](https://img.shields.io/badge/Level-expert-red.svg) ![Level](https://img.shields.io/badge/Level-expert-red.svg)
🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow 🎯 **objectives**: Learn how to integrate `intake-esm` in your workflow
⌛ **time_estimation**: "30min" ⌛ **time_estimation**: "30min"
☑️ **requirements**: ☑️ **requirements**:
- intake I - intake I
© **contributors**: k204210 © **contributors**: k204210
⚖ **license**: ⚖ **license**:
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{admonition} Agenda ```{admonition} Agenda
:class: tip full-width :class: tip full-width
Based on DKRZ's CMIP6 catalog, you learn in this part how to Based on DKRZ's CMIP6 catalog, you learn in this part how to
1. [add a **preprocessing** to `to_dataset_dict()`](#preprocess) 1. [add a **preprocessing** to `to_dataset_dict()`](#preprocess)
1. [create a derived variable registry](#derived) 1. [create a derived variable registry](#derived)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import intake import intake
dkrz_cdp=intake.open_catalog(["https://dkrz.de/s/intake"]) #dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
#only for generating the web page we need to take the original link:
dkrz_cdp=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
esm_dkrz=dkrz_cdp.dkrz_cmip6_disk esm_dkrz=dkrz_cdp.dkrz_cmip6_disk
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#levante uri to mistral uri: #levante uri to mistral uri:
esm_dkrz.df["uri"]=esm_dkrz.df["uri"].str.replace("lustre/","lustre02/") esm_dkrz.df["uri"]=esm_dkrz.df["uri"].str.replace("lustre/","lustre02/")
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="preprocessing"></a> <a class="anchor" id="preprocessing"></a>
## Use Preprocessing when opening assets and creating datasets ## Use Preprocessing when opening assets and creating datasets
When calling intake-esm's `to_dataset_dict` function, we can pass an argument **preprocess**. Its value should be a function which is applied to all assets before they are opened. When calling intake-esm's `to_dataset_dict` function, we can pass an argument **preprocess**. Its value should be a function which is applied to all assets before they are opened.
```{note} ```{note}
For CMIP6, a [preprocessing package](https://github.com/jbusecke/cmip6_preprocessing) has been developped for homogenizing and preparing datasets of different ESMs for a grand analysis featuring For CMIP6, a [preprocessing package](https://github.com/jbusecke/cmip6_preprocessing) has been developped for homogenizing and preparing datasets of different ESMs for a grand analysis featuring
- renaming and setting of coordinates - renaming and setting of coordinates
- adjusting grid values to fit into a common range (0-360 for lon) - adjusting grid values to fit into a common range (0-360 for lon)
``` ```
E.g., if you would like to set some specific variables as coordinates, you can define a [function](https://github.com/jbusecke/cmip6_preprocessing/blob/209041a965984c2dc283dd98188def1dea4c17b3/cmip6_preprocessing/preprocessing.py#L239) which E.g., if you would like to set some specific variables as coordinates, you can define a [function](https://github.com/jbusecke/cmip6_preprocessing/blob/209041a965984c2dc283dd98188def1dea4c17b3/cmip6_preprocessing/preprocessing.py#L239) which
- receives an xarray dataset as an argument - receives an xarray dataset as an argument
- returns a new xarray dataset - returns a new xarray dataset
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def correct_coordinates(ds) : def correct_coordinates(ds) :
"""converts wrongly assigned data_vars to coordinates""" """converts wrongly assigned data_vars to coordinates"""
ds = ds.copy() ds = ds.copy()
for co in [ for co in [
"x", "x",
"y", "y",
"lon", "lon",
"lat", "lat",
"lev", "lev",
"bnds", "bnds",
"lev_bounds", "lev_bounds",
"lon_bounds", "lon_bounds",
"lat_bounds", "lat_bounds",
"time_bounds", "time_bounds",
"lat_verticies", "lat_verticies",
"lon_verticies", "lon_verticies",
]: ]:
if co in ds.variables: if co in ds.variables:
ds = ds.set_coords(co) ds = ds.set_coords(co)
return ds return ds
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
Now, when you open the dataset dictionary, you provide it for *preprocess*: Now, when you open the dataset dictionary, you provide it for *preprocess*:
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
cat=esm_dkrz.search(variable_id="tas", cat=esm_dkrz.search(variable_id="tas",
table_id="Amon", table_id="Amon",
source_id="MPI-ESM1-2-HR", source_id="MPI-ESM1-2-HR",
member_id="r1i1p1f1", member_id="r1i1p1f1",
experiment_id="ssp370" experiment_id="ssp370"
) )
test_dsets=cat.to_dataset_dict( test_dsets=cat.to_dataset_dict(
zarr_kwargs={"consolidated":True}, zarr_kwargs={"consolidated":True},
cdf_kwargs={"chunks":{"time":1}}, cdf_kwargs={"chunks":{"time":1}},
preprocess=correct_coordinates preprocess=correct_coordinates
) )
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
<a class="anchor" id="derived"></a> <a class="anchor" id="derived"></a>
## Derived variables ## Derived variables
Most of the following is taken from the [intake-esm tutorial](https://intake-esm.readthedocs.io/en/latest/how-to/define-and-use-derived-variable-registry.html). Most of the following is taken from the [intake-esm tutorial](https://intake-esm.readthedocs.io/en/latest/how-to/define-and-use-derived-variable-registry.html).
> A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset. [...] Derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field. > A “derived variable” in this case is a variable that doesn’t itself exist in an intake-esm catalog, but can be computed (i.e., “derived”) from variables that do exist in the catalog. Currently, the derived variable implementation requires variables on the same grid, etc.; i.e., it assumes that all variables involved can be merged within the same dataset. [...] Derived variables could include more sophsticated diagnostic output like aggregations of terms in a tracer budget or gradients in a particular field.
The registry of the derived variables can be connected to the catalog. When users open The registry of the derived variables can be connected to the catalog. When users open
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import intake import intake
import intake_esm import intake_esm
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from intake_esm import DerivedVariableRegistry from intake_esm import DerivedVariableRegistry
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
```{seealso} ```{seealso}
This tutorial is part of a series on `intake`: This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html) * [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html) * [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html) * [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html) * [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html) * [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page. - You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment