Commit 789bbf48 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

updated intake

parent 7984c1cf
Pipeline #17572 passed with stage
in 17 minutes and 41 seconds
%% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections
⌛ **time_estimation**: "30min"
☑️ **requirements**: None
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [Motivation of intake-esm](#motivation)
1. [Features of intake and intake-esm](#features)
1. [Browse through catalogs](#browse)
1. [Data access via intake-esm](#dataaccess)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a>
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html).
## Motivation of intake-esm
> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
> `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**.
%% Cell type:markdown id: tags:
For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:
- 🌍 where data is saved
- 🪧 how data is saved
- 📤 how data should be loaded
but still can search, discover, access and load data of a project.
%% Cell type:markdown id: tags:
<a class="anchor" id="features"></a>
## Features of intake and intake-esm
Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.
- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without
- being next to the data (e.g. logged in on dkrz's luv)
- knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁
%% Cell type:markdown id: tags:
All required information for searching, accessing and loading the catalog's data is configured within the catalogs:
- 🌍 where data is saved
* users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
* Data of different platforms (cloud or disk) can be combined in one catalog
* On mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved
* users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
* catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤 how data should be loaded
* users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
* with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded
%% Cell type:markdown id: tags:
In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.
%% Cell type:code id: tags:
``` python
#note that intake_esm is imported with `import intake` as a plugin
import intake
```
%% Cell type:markdown id: tags:
<a class="anchor" id="terminology"></a>
## Terminology: **Catalog**, **Catalog file** and **Collection**
We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:
- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a>
- is a `.yaml` file
- can be opened with `open_catalog`, e.g.:
```python
intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. <a class="anchor" id="intakedriver"></a>
%% Cell type:markdown id: tags:
- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
- a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
- an **asset** is most often a file. <a class="anchor" id="asset"></a>
- a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a>
- the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json
"catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
```
```{note}
The term *collection* is often used synonymically for [catalog](#catalog).
```
%% Cell type:markdown id: tags:
- a *intake-esm* catalog 📖 is a `.json` file and can be opened with intake-esm's function `intake.open_esm_datastore()`, e.g:
```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
%% Cell type:markdown id: tags:
<a class="anchor" id="browse"></a>
## Open and browse through catalogs
We begin with using only *intake* functions for catalogs. Afterwards, we continue with concrete *intake-esm* utilites.
intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`:
%% Cell type:code id: tags:
``` python
dkrz_catalog=intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
%% Cell type:markdown id: tags:
```{note}
Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**.
```
%% Cell type:markdown id: tags:
We can look into the catalog with `print` and `list`
%% Cell type:code id: tags:
``` python
print(dkrz_catalog.yaml())
```
%% Cell type:code id: tags:
``` python
list(dkrz_catalog)
```
%% Cell type:markdown id: tags:
Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.
All these catalogs are **intake-esm** catalogs.
Let's have a look into a master catalog of [Pangeo](https://pangeo.io/):
%% Cell type:code id: tags:
``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
```
%% Cell type:code id: tags:
``` python
pangeo
```
%% Cell type:code id: tags:
``` python
list(pangeo)
```
%% Cell type:markdown id: tags:
While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply:
%% Cell type:code id: tags:
``` python
pangeo.climate
```
%% Cell type:markdown id: tags:
Pangeo's ESM collections are one level deeper in the catalog tree:
%% Cell type:code id: tags:
``` python
list(pangeo.climate)
```
%% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]`
where
- **project** can be one of the *model intercomparison project* and one of `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. If you use this ESM Collection, you have to work on the HPC if you want to load the data. Browsing and discovering will work independently from your work station.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*
%% Cell type:markdown id: tags:
**Best practice for naming catalogs**:
- Use small letters for all values
- Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
%% Cell type:markdown id: tags:
### The role of `intake-esm`
We now look into a catalog which is opened by the plugin `intake-esm`.
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col)
```
%% Cell type:markdown id: tags:
`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.
The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.08 Mio Files** in 2022.
%% Cell type:markdown id: tags:
The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:
%% Cell type:code id: tags:
``` python
esm_col.df.head()
```
%% Cell type:markdown id: tags:
We can find out details about `esm_col` with the object's attributes. `esm_col.esmcol_data` contains all information given in the `JSON` file. We can also focus on some specific attributes.
%% Cell type:code id: tags:
``` python
#esm_col.esmcol_data
```
%% Cell type:code id: tags:
``` python
print("What is this catalog about? \n" + esm_col.esmcol_data["description"])
#
print("The link to the data base: "+ esm_col.esmcol_data["catalog_file"])
```
%% Cell type:markdown id: tags:
Advanced: To find out how many datasets are available, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one):
%% Cell type:code id: tags:
``` python
cat = esm_col.df.drop(['uri','time_range'],1).drop_duplicates(keep="first")
print(len(cat))
```
%% Cell type:markdown id: tags:
### Browse through the data of the ESM collection
%% Cell type:markdown id: tags:
You will browse the collection technically by setting values the **column names** of the underlying table. Per default, the catalog was loaded with the basic cmip6 attributes/columns which you can see with:
%% Cell type:code id: tags:
``` python
esm_col.df.columns
```
%% Cell type:markdown id: tags:
which are the same as configured in the [top level catalog file]:
%% Cell type:code id: tags:
``` python
dkrz_catalog.metadata["parameters"]
```
%% Cell type:markdown id: tags:
If you need more information, e.g. the **long_name**s of the variables, you can
1. check if they are available in the catalog's data base with the metadata specified in the [top level catalog file](#catalogfile) :
%% Cell type:code id: tags:
``` python
dkrz_catalog.metadata["dkrz_cmip6_disk"]["additional_columns"]
dkrz_catalog.metadata["parameters"]["additional_cmip6_columns"]
```
%% Cell type:markdown id: tags:
There is a lot of redundancy in the columns. That is because they exist to be conform to other kind of standards. This will simplify merging catalogs across projects.
2. create a combination of all your required columns:
%% Cell type:code id: tags:
``` python
cols=dkrz_catalog.metadata["parameters"]["cmip6_columns"]["default"]+["opendap_url"]
```
%% Cell type:markdown id: tags:
3. open the **dkrz_cmip6_disk** catalog with the `csv_kwargs` keyword argument in this way:
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))
```
%% Cell type:markdown id: tags:
```{warning}
The number of columns determines the required memory.
```
%% Cell type:markdown id: tags:
```{tip}
If you work from remote and also want to access the data remotely, load the *opendap_url* column.
```
%% Cell type:markdown id: tags:
Most of the time, we want to set more than one attribute for a search. Therefore, we define a query `dict`ionary and use the `search` function of the `esm_col` object. In the following case, we look for temperature at surface in monthly resolution for 3 different experiments:
%% Cell type:code id: tags:
``` python
query = dict(
variable_id="tas",
table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"])
# piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years.
# historical = historical Simulation, 1850-2014
# ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100
cat = esm_col.search(**query)
```
%% Cell type:code id: tags:
``` python
cat
```
%% Cell type:markdown id: tags:
We could also use *Wildcards*. For example, in order to find out which ESMs of the institution *MPI-M* have produced data for our subset:
%% Cell type:code id: tags:
``` python
cat.search(source_id="MPI-ES*")
```
%% Cell type:markdown id: tags:
We can find out which models have submitted data for at least one of them by:
%% Cell type:code id: tags:
``` python
cat.unique(["source_id"])
```
%% Cell type:markdown id: tags:
If we instead look for the models that have submitted data for ALL experiments, we use the `require_all_on` keyword argument:
%% Cell type:code id: tags:
``` python
cat = esm_col.search(require_all_on=["source_id"], **query)
cat.unique(["source_id"])
```
%% Cell type:markdown id: tags:
Note that only the combination of a `variable_id` and a `table_id` is unique in CMIP6. If you search for `tas` in all tables, you will find many entries more:
%% Cell type:code id: tags:
``` python
query = dict(
variable_id="tas",
# table_id="Amon",
experiment_id=["piControl", "historical", "ssp370"])
cat = esm_col.search(**query)
cat.unique(["table_id"])
```
%% Cell type:markdown id: tags:
Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data.
%% Cell type:markdown id: tags:
<a class="anchor" id="dataaccess"></a>
## Access and load data of the ESM collection
With the power of `xarray`, `intake` can load your subset into a `dict`ionary of datasets. We therefore focus on the data of `MPI-ESM1-2-LR`:
%% Cell type:code id: tags:
``` python
#case insensitive?
query = dict(
variable_id="tas",
table_id="Amon",
source_id="MPI-ESM1-2-HR",
experiment_id="historical")
cat = esm_col.search(**query)
cat
```
%% Cell type:markdown id: tags:
You can find out which column intake uses to access the data via the following keyword:
%% Cell type:code id: tags:
``` python
print(cat.path_column_name)
```
%% Cell type:markdown id: tags:
As we are working with the *_disk* catalog, **uri** contains *pathes* to the files on filesystem. If you are working from remote, you would have
- to change `path_column_name` to *opendap_url*.
- to reassign the `format` column from *netcdf* to *opendap*
as follows:
%% Cell type:code id: tags:
``` python
cat.path_column_name="opendap_url"
newdf=cat.df.copy()
newdf.loc[:,"format"]="opendap"
cat.df=newdf
```
%% Cell type:markdown id: tags:
The corresponding function is `to_dataset_dict`. We also have to set the `chunks` keyword argument for `xarray` because `xarray` chooses too large chunks otherwise. <mark> Note that this keyword argument varies depending on the fileaccess/fileformat.</mark> If your collection contains `zarr` formatted data, you need a different keyword argument.
%% Cell type:code id: tags:
``` python
xr_dict = cat.to_dataset_dict(cdf_kwargs={"chunks":{"time":1}})
xr_dict
```
%% Cell type:markdown id: tags:
`Intake` was able to aggregate many files into only one dataset:
- The `time_range` column was used to **concat** data along the `time` dimension
- The `member_id` column was used to generate a new dimension
The underlying `dask` package will only load the data into memory if needed.
We can get the `xarray` dataset with python commands:
%% Cell type:code id: tags:
``` python
xr_dset = xr_dict.popitem()[1]
xr_dset
```
%% Cell type:markdown id: tags:
#### Pangeo's data store
Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before.
%% Cell type:code id: tags:
``` python
pangeo_cmip6=pangeo.climate.cmip6_gcs
cat = pangeo_cmip6.search(**query)
cat
```
%% Cell type:markdown id: tags:
There are differences between the collections because
- Pangeo provides files in *consolidated*, `zarr` formatted datasets which correspond to `zstore` entries in the catalog instead of `path`s or `opendap_url`s.
- The `zarr` datasets are already aggregated over time so there is no need for a `time_range` column
If we now open the data with `intake`, we have to specify keyword arguments as follows:
%% Cell type:code id: tags:
``` python
dset_dict = cat.to_dataset_dict(
zarr_kwargs={"consolidated": True, "decode_times": True, "use_cftime": True}
)
```
%% Cell type:code id: tags:
``` python
dset_dict
```
%% Cell type:markdown id: tags:
`dset_dict` and `xr_dict` are the same. You succesfully did the intake tutorial!
%% Cell type:markdown id: tags:
```{seealso}
This tutorial is part of a series on `intake`:
* [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
* [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
* [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
* [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
* [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)
- You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.
```
%% Cell type:markdown id: tags:
%% Cell type:code id: tags:
``` python
```
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment