Commit 6465bc05 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Updated intake 1-2

parent 87adf72c
Pipeline #17545 passed with stage
in 26 seconds
%% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn what `intake-esm` ESM-collections DKRZ offer
⌛ **time_estimation**: "15min"
☑️ **requirements**: None
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [DKRZ intake-esm catalog schema](#examples)
1. [DKRZ intake-esm catalogs for project data](#examples)
1. [Catalog dependencies on different stores](#stores)
1. [Workflow at Levante for collecting and merging catalogs into main catalog](#workflow)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Closer look into DKRZ intake-esm catalogs for project data
## DKRZ intake-esm catalog strategy and schema
- You will find only **one version** for each *atomic dataset* in each catalog. This is the most recent one available in the store.
- All catalogs try to follow the *cataloonies schema* for dkrz catalogs
- Across the project you will find **redundant** attributes in the catalogs which have the same meaning, e.g:
DKRZ catalogs aim at using one common scheme for its attributes so that **combining** catalogs and working with multiple catalogs on the same time will be easy. In collaboration with NextGEMS scientists, we agreed on some attribute names that DKRZ intake-esm catalogs should be equipped with. The resulting scheme is named *cataloonies scheme*.
```{note}
The cataloonies scheme is not a standard for anything but evolving and adapting to use cases. It is influenced mainly by ICON output and the CMIP standard. If you have suggestions, please contact us.
```
- As a result, you will find **redundant** attributes in project catalogs which have the same meaning, e.g:
- source_id, model_id, model
- member_id, ensemble_member, simulation_id
- Which of these attributes are loaded into the python workflow can be set (see intake-1).
- You will find only **one version** for each *atomic dataset* in each catalog. This is the most recent one available in the store. An atomic dataset is found if unique values are set for all catalog attributes with one exception: it covers the entire time span of the simulation.
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
### The cataloonies scheme
ESM script developers, project scientists and data managers together defined some attribute names that DKRZ intake-esm catalogs should be equipped with. One benefit, originated by the composition of this working group, is that these attribute names can be used throughout the research data life cycle: At the earliest point, for model raw output, but also at the latest point, for data that is standardized and published.
The integration of intake-esm catalog generation into ESM run scripts is planned. That will enable the usage of intake and with it the easy usage and configuration of the python software stack for data processing from the beginning of data life.
Already existing catalogs will be provided with the newly defined attributes. For some, the values will fall back to *None* as there is no easy way to retrieve the values without looking into the asset which is technically not implementable when taking into account the amount of published project data. For CMIP6-like projects, we can take missing information from the *cmor-mip-tables* which represent the data standard of the project.
%% Cell type:code id: tags:
``` python
import pandas as pd
cataloonies_raw=[["#","Attribute name and column name","Examples","Description","Comments"],
[1,"project","DYAMON-WINTER","The project in which the data was produced",],
[2,"institution_id","MPIM-DWD-DKRZ","The institution that runs the simulations",],
[3,"source_id","ICON-SAP-5km","The Earth System Model which produced the simulations",],
[4,"experiment_id","DW-CPL / DW-ATM","The short term for the experiment that was run",],
[5,"simulation_id","dpp1234","The simulation/member/realization of the ensemble.",],
[6,"realm","atm / oce","The submodel of the ESM which produces the output.",],
[7,"frequency","PT1h or 1hr – Style","The frequency of the output","ICON uses ISO format"],
[8,"time_reduction","mean / inst / timmax /…","The method used for sampling and averaging along time. The same as the time part of cell_methods.",],
[9,"grid_label","gn","A clear description for the grid for distingusihing between native and regridded grids.",],
[10,"grid_id","","A specific identifier of the grid.","we might need more than one (e.g. horizontal + vertical)"],
[11,"variable_id","tas","The CMIP short term of the variable.",],
[12,"level_type","pressure_level, atmosphere_level","The vertical axis type used for the variable.",],
[13,"time_min",1800,"The minimal time value covered by the asset.",],
[14,"time_max",1900,"The maximal time value covered by the asset.",],
[15,"format","netcdf/zarr/…","The format of the asset.",],
[16,"uri","url,path-to-file","The uri used to open and load the asset.",],
[17,"(time_range)","start-end","Combination of time_min and time_max.",]]
pd.DataFrame(cataloonies_raw[1:],columns=cataloonies_raw[0])[cataloonies_raw[0][1:-1]]
```
%%%% Output: execute_result
Attribute name and column name ... Description
0 project ... The project in which the data was produced
1 institution_id ... The institution that runs the simulations
2 source_id ... The Earth System Model which produced the simu...
3 experiment_id ... The short term for the experiment that was run
4 simulation_id ... The simulation/member/realization of the ensem...
5 realm ... The submodel of the ESM which produces the out...
6 frequency ... The frequency of the output
7 time_reduction ... The method used for sampling and averaging alo...
8 grid_label ... A clear description for the grid for distingus...
9 grid_id ... A specific identifier of the grid.
10 variable_id ... The CMIP short term of the variable.
11 level_type ... The vertical axis type used for the variable.
12 time_min ... The minimal time value covered by the asset.
13 time_max ... The maximal time value covered by the asset.
14 format ... The format of the asset.
15 uri ... The uri used to open and load the asset.
16 (time_range) ... Combination of time_min and time_max.
[17 rows x 3 columns]
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## DKRZ intake-esm catalogs for community project data
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
By 2022, we offer you project data for
```{tabbed} CMIP6-like projects
- 💾 **projects** :
- [CMIP6](https://c6de.dkrz.de/)
- [PalMod2](https://www.palmod.de/)
- **Data location**:
- is hosted within the [cmip-data-pool](cmip-data-pool.dkrz.de) accessible for all dkrz users under `/pool/data/PROJECT`
- **Attributes**:
- the catalog's attributes are explained [here](https://goo.gl/v1drZl)
- `source_id` and `experiment_id` are the most important attributes. A unique `source_id` refers to one and only the one same model. Different institutions can use it but it is the same model. An `experiment_id` can be found only in *one activity*.
- values of `member_id` are rather arbitrary. Some member might never be published, others have been retracted. There is no gurantee in having *r1i1p1f1* availabe.
- **Variable definition**:
- a **unique variable** is a combination of `table_id` and `variable_id`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.
```
```{tabbed} CMIP5-like projects
- 💾 **projects** :
- CMIP5:
- [CORDEX](https://is-enes-data.github.io/cordex_archive_specifications.pdf)
- MPI-GE
- **Data location**:
- CMIP5:
- Only a small subset is still available on the pool's common and shared disk resource due to a lack of disk storage capacity. But most of the data have been archived and can be accessed via `jblob`.
- CORDEX:
- On disk, CORDEX data are disseminated across different storage projects
- **Attributes**:
- In comparison to CMIP6-like projects, such projects build on the older CMIP5 standard. Therefore, some attributes have different names.
- Regional model data includes additional attributes incomparison to CMIP5: `CORDEX_domain`, `driving_model_id` and `rcm_version_id`.
- **Variable definition**:
- A **unique variable** is a combination of `mip_table` and `variable`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.
```
```{tabbed} Other projects
```{tabbed} ESM-raw-output-near projects
- 💾 **projects** :
- [Dyamond]()
- [ERA5]():
- [DYAMOND](https://easy.gems.dkrz.de/DYAMOND/index.html)
- [NextGEMS](https://easy.gems.dkrz.de/DYAMOND/NextGEMS/index.html)
- [ERA5](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html):
- **Data location**:
- For disk projects, mostly linked under `/pool/data`
- **Attributes**:
- We try to use the cataloonies schema for catalogs of all other projects. For reanalysis products, it cannot be entirely fulfilled as the data is available in GRIB format.
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Catalog dependencies on different stores
DKRZ's catalog naming convention distinguishes between the different storage formats for as long as data access to stores like archive is either not possible or very different to disk access. The specialties of the storages are explained in the following.
%% Cell type:markdown id: tags:
```{tabbed} Disk
- **Way of data access**:
- `uri` contain *pathes* on levante's lustre filesystem
- **User requirements for data access**:
- users muste be **logged in** to levante to *access* the data (exception: opendap_url column, see [introduction]())
- **Provider requirements**:
- pathes of a valid catalog must be *readable for everyone*
```
```{tabbed} Cloud
- **Way of data access**:
- `uri` contain *links* to datasets in dkrz's swift cloud storage which can be opened with `xarray` and therefore `intake`
- If the asset's `format` is *zarr* (specified in the `format` column), use `zarr_kwargs` in the `to_dataset_dict()` function
- **User requirements for data access**:
- None. Users can access it from everywhere if internet connection is sufficient
- **Provider requirements**:
- links in a valid catalog must point at datasets in **open containers**
```
```{tabbed} Archive
- **Way of data access**:
- `uri` is empty i.e. no direct data access via intake is possible. If the catalog contains a `jblob_file` column, users can however download the data via *jblob* on levante (see next point).
- **User requirements for data access**:
- users muste be **logged in** to levante to *access* the data. After loading the module on levante via `module load jblob`, e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`
- **Provider requirements**:
- None
```
%% Cell type:markdown id: tags:
<a class="anchor" id="examples"></a>
## Preparing project catalogs for DKRZ's main catalog
1. Use attributes of existing catalogs and/or templates in `/pool/data/Catalogs/Templates` but at least `uri`, `format` and `project`.
1. Make the data referenced in the catalog and the catalog readable for everyone.
1. Use the naming convention for dkrz catalogs: `dkrz_PROJECT_STORE`.
1. Set permissions to *readable for everyone* for
- the data referenced in the catalog under `uri`
- the catalog itself
1. Use the naming convention for dkrz catalogs ( `dkrz_PROJECT_STORE` ) for your catalog
1. Link the catalog via `ln -s PATH-TO-YOUR-CATALOG /pool/data/Catalogs/Candidates/YOUR-CATALOG`
%% Cell type:markdown id: tags:
Your catalog then will be catched by a cronjob which
1. tests your catalog
- against the catalog naming convention
- open, search and load
- if for disk, are all `uri` values *readable*?
1. merges or creates your catalog
- if a catalog for the specified project exists in `/pool/data/Catalogs/`, they will be merged if possible
- if a catalog for the specified project exists in `/pool/data/Catalogs/`, they will be merged if possible. Entries of your catalog will be merged if they are no duplicates.
- else, your catalog will be written to `/work/ik1017/Catalogs` and a link will be set in `/pool/data/Catalogs/`
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment