Commit b502c15c authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Updated tuts

parent 0a3d5858
Pipeline #18272 passed with stage
in 8 minutes and 43 seconds
%% Cell type:markdown id: tags:
# Intake I part 2 - DKRZ catalog scheme, strategy and services
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn what `intake-esm` ESM-collections DKRZ offer
⌛ **time_estimation**: "15min"
☑️ **requirements**: None
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [DKRZ intake-esm catalog schema](#examples)
1. [DKRZ intake-esm catalogs for project data](#examples)
1. [Catalog dependencies on different stores](#stores)
1. [Workflow at Levante for collecting and merging catalogs into main catalog](#workflow)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## DKRZ intake-esm catalog strategy and schema
DKRZ catalogs aim at using one common scheme for its attributes so that **combining** catalogs and working with multiple catalogs on the same time will be easy. In collaboration with NextGEMS scientists, we agreed on some attribute names that DKRZ intake-esm catalogs should be equipped with. The resulting scheme is named *cataloonies scheme*.
```{note}
The cataloonies scheme is not a standard for anything but it is evolving and will be adapted to use cases. It is mainly influenced by ICON output and the CMIP standard. If you have suggestions, please contact us.
```
- As a result, you will find **redundant** attributes in project catalogs which have the same meaning, e.g:
- source_id, model_id, model
- member_id, ensemble_member, simulation_id
- Which of these attributes are loaded into the python workflow can be set (see intake-1).
- You will find only **one version** for each *atomic dataset* in each catalog. This is the most recent one available in the store. An atomic dataset is found if unique values are set for all catalog attributes with one exception: it covers the entire time span of the simulation.
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
### The cataloonies scheme
ESM script developers, project scientists and data managers together defined some attribute names that DKRZ intake-esm catalogs should be equipped with. One benefit, originated by the composition of this working group, is that these attribute names can be used throughout the research data life cycle: At the earliest point, for model raw output, but also at the latest point, for data that is standardized and published.
The integration of intake-esm catalog generation into ESM run scripts is planned. That will enable the usage of intake and with it the easy usage and configuration of the python software stack for data processing from the beginning of data life.
Already existing catalogs will be provided with the newly defined attributes. For some, the values will fall back to *None* as there is no easy way to retrieve the values without looking into the asset which is technically not implementable when taking into account the amount of published project data. For CMIP6-like projects, we can take missing information from the *cmor-mip-tables* which represent the data standard of the project.
%% Cell type:code id: tags:hide-input
``` python
import pandas as pd
cataloonies_raw=[["#","Attribute name and column name","Examples","Description","Comments"],
[1,"project","DYAMON-WINTER","The project in which the data was produced",],
[2,"institution_id","MPIM-DWD-DKRZ","The institution that runs the simulations",],
[3,"source_id","ICON-SAP-5km","The Earth System Model which produced the simulations",],
[4,"experiment_id","DW-CPL / DW-ATM","The short term for the experiment that was run",],
[5,"simulation_id","dpp1234","The simulation/member/realization of the ensemble.",],
[6,"realm","atm / oce","The submodel of the ESM which produces the output.",],
[7,"frequency","PT1h or 1hr – Style","The frequency of the output","ICON uses ISO format"],
[8,"time_reduction","mean / inst / timmax /…","The method used for sampling and averaging along time. The same as the time part of cell_methods.",],
[9,"grid_label","gn","A clear description for the grid for distingusihing between native and regridded grids.",],
[10,"grid_id","","A specific identifier of the grid.","we might need more than one (e.g. horizontal + vertical)"],
[11,"variable_id","tas","The CMIP short term of the variable.",],
[12,"level_type","pressure_level, atmosphere_level","The vertical axis type used for the variable.",],
[13,"time_min",1800,"The minimal time value covered by the asset.",],
[14,"time_max",1900,"The maximal time value covered by the asset.",],
[15,"format","netcdf/zarr/…","The format of the asset.",],
[16,"uri","url,path-to-file","The uri used to open and load the asset.",],
[17,"(time_range)","start-end","Combination of time_min and time_max.",]]
pd.DataFrame(cataloonies_raw[1:],columns=cataloonies_raw[0])[cataloonies_raw[0][1:-1]]
```
%%%% Output: execute_result
Attribute name and column name ... Description
0 project ... The project in which the data was produced
1 institution_id ... The institution that runs the simulations
2 source_id ... The Earth System Model which produced the simu...
3 experiment_id ... The short term for the experiment that was run
4 simulation_id ... The simulation/member/realization of the ensem...
5 realm ... The submodel of the ESM which produces the out...
6 frequency ... The frequency of the output
7 time_reduction ... The method used for sampling and averaging alo...
8 grid_label ... A clear description for the grid for distingus...
9 grid_id ... A specific identifier of the grid.
10 variable_id ... The CMIP short term of the variable.
11 level_type ... The vertical axis type used for the variable.
12 time_min ... The minimal time value covered by the asset.
13 time_max ... The maximal time value covered by the asset.
14 format ... The format of the asset.
15 uri ... The uri used to open and load the asset.
16 (time_range) ... Combination of time_min and time_max.
[17 rows x 3 columns]
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## DKRZ intake-esm catalogs for community project data
%% Cell type:markdown id: tags:
### Jobs we do for you
- We **make all catalogs available**
- under `/pool/data/Catalogs/` for logged-in HPC users
- in the [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/tree/master/esm-collections)
- We **create and update** the content of project's catalogs regularly by running scripts which are automatically executed and called _cronjobs_. We set the creation frequency so that the data of the project is updated sufficently quickly.
- The updated catalog __replaces__ the outdated one.
- The updated catalog is __uploaded__ to the DKRZ swift cloud
- We plan to provide a catalog that tracks data which is __removed__ by the update.
%% Cell type:markdown id: tags:
### The data bases of project catalogs
**Creation of the `.csv.gz` table :**
1. A file list is created based on a `find` shell command on the project directory in the data pool.
1. For the column values, filenames and pathes are parsed according to the project's `path_template` and `filename_template`. These templates need to be constructed with attribute values requested and required by the project.
- Filenames that cannot be parsed are sorted out
1. If more than one version is found for a dataset, only the most recent one is kept.
1. Depending on the project, additional columns can be created by adding project's specifications.
- E.g., for CMIP6, we added a `OpenDAP` column which allows users to access data from everywhere via `http`
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
By 2022, we offer you project data for
```{tabbed} CMIP6-like projects
- 💾 **projects** :
- [CMIP6](https://c6de.dkrz.de/)
- [PalMod2](https://www.palmod.de/)
- **Data location**:
- is hosted within the [cmip-data-pool](cmip-data-pool.dkrz.de) accessible for all dkrz users under `/pool/data/PROJECT`
- **Attributes**:
- the catalog's attributes are explained [here](https://goo.gl/v1drZl)
- `source_id` and `experiment_id` are the most important attributes. A unique `source_id` refers to one and only the one same model. Different institutions can use it but it is the same model. An `experiment_id` can be found only in *one activity*.
- values of `member_id` are rather arbitrary. Some member might never be published, others have been retracted. There is no gurantee in having *r1i1p1f1* availabe.
- **Variable definition**:
- a **unique variable** is a combination of `table_id` and `variable_id`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.
```
```{tabbed} CMIP5-like projects
- 💾 **projects** :
- CMIP5:
- [CORDEX](https://is-enes-data.github.io/cordex_archive_specifications.pdf)
- **Data location**:
- CMIP5:
- Only a small subset is still available on the pool's common and shared disk resource due to a lack of disk storage capacity. But most of the data have been archived and can be accessed via `jblob`.
- CORDEX:
- On disk, CORDEX data are disseminated across different storage projects
- **Attributes**:
- In comparison to CMIP6-like projects, such projects build on the older CMIP5 standard. Therefore, some attributes have different names.
- Regional model data includes additional attributes incomparison to CMIP5: `CORDEX_domain`, `driving_model_id` and `rcm_version_id`.
- **Variable definition**:
- A **unique variable** is a combination of `mip_table` and `variable`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.
```
```{tabbed} ESM-raw-output-near projects
- 💾 **projects** :
- [DYAMOND](https://easy.gems.dkrz.de/DYAMOND/index.html)
- [NextGEMS](https://easy.gems.dkrz.de/DYAMOND/NextGEMS/index.html)
- [ERA5](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html):
- **Data location**:
- For disk projects, mostly linked under `/pool/data`
- **Attributes**:
- We try to use the cataloonies schema for catalogs of all other projects. For reanalysis products, it cannot be entirely fulfilled as the data is available in GRIB format.
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Catalog dependencies on different stores
DKRZ's catalog naming convention distinguishes between the different storage formats for as long as data access to stores like archive is either not possible or very different to disk access. The specialties of the storages are explained in the following.
%% Cell type:markdown id: tags:
```{tabbed} Disk
- **Way of data access**:
- `uri` contain *pathes* on levante's lustre filesystem
- **User requirements for data access**:
- users muste be **logged in** to levante to *access* the data (exception: opendap_url column, see [introduction]())
- **Provider requirements**:
- pathes of a valid catalog must be *readable for everyone*
```
```{tabbed} Cloud
- **Way of data access**:
- `uri` contain *links* to datasets in dkrz's swift cloud storage which can be opened with `xarray` and therefore `intake`
- If the asset's `format` is *zarr* (specified in the `format` column), use `zarr_kwargs` in the `to_dataset_dict()` function
- **User requirements for data access**:
- None. Users can access it from everywhere if internet connection is sufficient
- **Provider requirements**:
- links in a valid catalog must point at datasets in **open containers**
```
```{tabbed} Archive
- **Way of data access**:
- `uri` is empty i.e. no direct data access via intake is possible. If the catalog contains a `jblob_file` column, users can however download the data via *jblob* on levante (see next point).
- **User requirements for data access**:
- users muste be **logged in** to levante to *access* the data. After loading the module on levante via `module load jblob`, e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`
- **Provider requirements**:
- None
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Preparing project catalogs for DKRZ's main catalog
1. Use attributes of existing catalogs and/or templates in `/pool/data/Catalogs/Templates` but at least `uri`, `format` and `project`.
1. Set permissions to *readable for everyone* for
- the data referenced in the catalog under `uri`
- the catalog itself
1. Use the naming convention for dkrz catalogs ( `dkrz_PROJECT_STORE` ) for your catalog
1. Link the catalog via `ln -s PATH-TO-YOUR-CATALOG /pool/data/Catalogs/Candidates/YOUR-CATALOG`
%% Cell type:markdown id: tags:
Your catalog then will be catched by a cronjob which
1. tests your catalog
- against the catalog naming convention
- open, search and load
- if for disk, are all `uri` values *readable*?
1. merges or creates your catalog
- if a catalog for the specified project exists in `/pool/data/Catalogs/`, they will be merged if possible. Entries of your catalog will be merged if they are no duplicates.
- else, your catalog will be written to `/work/ik1017/Catalogs` and a link will be set in `/pool/data/Catalogs/`
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
```
......
%% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections
⌛ **time_estimation**: "30min"
☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [Motivation of intake-esm](#motivation)
1. [Features of intake and intake-esm](#features)
1. [Browse through catalogs](#browse)
1. [Data access via intake-esm](#dataaccess)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a>
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html).
## Motivation of intake-esm
> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
> `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**.
%% Cell type:markdown id: tags:
For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:
- 🌍 where data is saved
- 🪧 how data is saved
- 📤 how data should be loaded
but still can search, discover, access and load data of a project.
%% Cell type:markdown id: tags:
<a class="anchor" id="features"></a>
## Features of intake and intake-esm
Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.
- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without
- being next to the data (e.g. logged in on dkrz's luv)
- knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁
%% Cell type:markdown id: tags:
All required information for searching, accessing and loading the catalog's data is configured within the catalogs:
- 🌍 where data is saved
* users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
* Data of different platforms (cloud or disk) can be combined in one catalog
* On mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved
* users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
* catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤 how data should be loaded
* users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
* with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded
%% Cell type:markdown id: tags:
In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.
%% Cell type:code id: tags:
``` python
#note that intake_esm is imported with `import intake` as a plugin
import intake
```
%% Cell type:markdown id: tags:
<a class="anchor" id="terminology"></a>
## Terminology: **Catalog**, **Catalog file** and **Collection**
We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:
- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a>
- is a `.yaml` file
- can be opened with `open_catalog`, e.g.:
```python
intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. <a class="anchor" id="intakedriver"></a>
%% Cell type:markdown id: tags:
- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
- a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
- an **asset** is most often a file. <a class="anchor" id="asset"></a>
- a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a>
- the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json
"catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
```
```{note}
The term *collection* is often used synonymically for [catalog](#catalog).
```
%% Cell type:markdown id: tags:
- a *intake-esm* catalog 📖 is a `.json` file and can be opened with intake-esm's function `intake.open_esm_datastore()`, e.g:
```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json")
```
%% Cell type:markdown id: tags:
<a class="anchor" id="browse"></a>
## Open and browse through catalogs
We begin with using only *intake* functions for catalogs. Afterwards, we continue with concrete *intake-esm* utilites.
intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
<mark> You only need to remember one URL as the *single point of access* for DKRZ's intake catalogs: The DKRZ top level catalog can be accessed via dkrz.de/s/intake . Since intake does not follow the redirect, you can use **requests** to work with that link:</mark>
%% Cell type:code id: tags:
``` python
#import requests
#r=requests.get("http://dkrz.de/s/intake")
#print(r.url)
```
%% Cell type:code id: tags:
``` python
#dkrz_catalog=intake.open_catalog(r.url)
dkrz_catalog=intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
%% Cell type:markdown id: tags:
```{note}
Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**.
```
%% Cell type:markdown id: tags:
We can look into the catalog with `print` and `list`
%% Cell type:code id: tags:
``` python
list(dkrz_catalog)
```
%% Cell type:code id: tags:
``` python
print(dkrz_catalog.yaml())
```
%% Cell type:markdown id: tags:
Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.
All these catalogs are **intake-esm** catalogs.
%% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]`
where
- **project** can be one of the *model intercomparison project* and one of `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. If you use this ESM Collection, you have to work on the HPC if you want to load the data. Browsing and discovering will work independently from your work station.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*
%% Cell type:markdown id: tags:
**Best practice for naming catalogs**:
- Use small letters for all values
- Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
%% Cell type:markdown id: tags:
Let's have a look into a master catalog of [Pangeo](https://pangeo.io/):
%% Cell type:code id: tags:
``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
```
%% Cell type:code id: tags:
``` python
pangeo
```
%% Cell type:code id: tags:
``` python
list(pangeo)
```
%% Cell type:markdown id: tags:
While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply:
%% Cell type:code id: tags:
``` python
pangeo.climate
```
%% Cell type:markdown id: tags:
Pangeo's ESM collections are one level deeper in the catalog tree:
%% Cell type:code id: tags:
``` python
list(pangeo.climate)
```
%% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]`
where
- **project** can be one of the *model intercomparison project* and one of `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. If you use this ESM Collection, you have to work on the HPC if you want to load the data. Browsing and discovering will work independently from your work station.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*
### The `intake-esm` catalogs
%% Cell type:markdown id: tags:
We now look into a catalog which is opened by the plugin `intake-esm`.
**Best practice for naming catalogs**:
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
- Use small letters for all values
- Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
%% Cell type:markdown id: tags:
### The role of `intake-esm`
We now look into a catalog which is opened by the plugin `intake-esm`.
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
```{note}
The project catalogs contain only valid and current project data. They are constantly updated.
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
If your work is based on a catalog and a subset of the data from it, be sure to save that subset so you can later compare your database to the most current catalog.
```
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col)
```
%% Cell type:markdown id: tags:
`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.
The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.08 Mio Files** in 2022.
%% Cell type:markdown id: tags:
The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:
%% Cell type:code id: tags:
``` python