Commit 73852a78 authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Updated intake and atmodat checker

parent d2713103
Pipeline #18244 canceled with stage
in 15 minutes and 4 seconds
%% Cell type:markdown id:77da5fe5-80a7-4952-86e4-f39b3f06ddef tags:
## ATMODAT Standard Compliance Checker
This notebook introduces you to the [atmodat checker](https://github.com/AtMoDat/atmodat_data_checker) which contains checks to ensure compliance with the ATMODAT Standard.
> Its core functionality is based on the [IOOS compliance checker](https://github.com/ioos/compliance-checker). The ATMODAT Standard Compliance Checker library makes use of [cc-yaml](https://github.com/cedadev/cc-yaml), which provides a plugin for the IOOS compliance checker that generates check suites from YAML descriptions. Furthermore, the Compliance Check Library is used as the basis to define generic, reusable compliance checks.
In addition, the compliance to the **CF Conventions 1.4 or higher** is verified with the [CF checker](https://github.com/cedadev/cf-checker).
%% Cell type:markdown id:edb35c53-dc33-4f1f-a4af-5a8ea69e5dfe tags:
In this notebook, you will learn
- [how to use an environment on DKRZ HPC mistral or levante](#Preparation)
- [how to run checks with the atmodat data checker](#Application)
- [to understand the results of the checker and further analyse it with pandas](#Results)
- [how you could proceed to cure the data with xarray if it does not pass the QC](#Curation)
%% Cell type:markdown id:3abf2250-4b78-4043-82fe-189875d692f2 tags:
### Preparation
On DKRZ's High-performance computer PC, we provide a `conda` environment which are useful for working with data in DKRZ’s CMIP Data Pool.
**Option 1: Activate checker libraries for working with a comand-line shell**
If you like to work with shell commands, you can simply activate the environment. Prior to this, you may have
to load a module with a recent python interpreter
```bash
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
```
%% Cell type:markdown id:dff94c1c-8aa1-42aa-9486-f6d5a6df1884 tags:
**Option 2: Create a kernel with checker libraries to work with jupyter notebooks**
With `ipykernel` you can install a *kernel* which can be used within a jupyter server like [jupyterhub](https://jupyterhub.dkrz.de). `ipykernel` creates the kernel based on the activated environment.
```bash
module load python3/unstable
#The following line activates the quality-assurance environment mit den checker libraries so that you can execute them with shell commands:
source activate /work/bm0021/conda-envs/quality-assurance
python -m ipykernel install --user --name qualitychecker --display-name="qualitychecker"
```
If you run this command from within a jupyter server, you have to restart the jupyterserver afterwards to be able to select the new *quality checker* kernel.
%% Cell type:markdown id:95f9ba22-f84c-42e4-9952-ff6ef4f7b86d tags:
**Expert mode**: Running the jupyter server from a different environment than the environment in which atmodat is installed
Make sure that you:
1. Install the `cfunits` package to the jupyter environment via `conda install cfunits -c conda-forge -p $jupyterenv` and restart the kernel.
1. Add the atmodat environment to the `PATH` environment variable inside the notebook. Otherwise, the notebook's shell does not find the application `run_checks`. You can modify environment variables with the `os` package and its command `os.envrion`. The environment of the kernel can be found with `sys` and `sys.executable`. The following block sets the environment variable `PATH` correctly:
%% Cell type:code id:955fcaff-3b3f-4e5e-8c56-59ed90a4bca2 tags:
``` python
import sys
import os
os.environ["PATH"]=os.environ["PATH"]+":"+os.path.sep.join(sys.executable.split('/')[:-1])
```
%% Cell type:code id:72c0158e-1fbb-420b-8976-329579e397b9 tags:
``` python
#As long as there is the installation bug, we have to manually get the Atmodat CVs:
if not "AtMoDat_CVs" in [dirpath.split(os.path.sep)[-1]
for (dirpath, dirs, files) in os.walk(os.path.sep.join(sys.executable.split('/')[:-2]))] :
!git clone https://github.com/AtMoDat/AtMoDat_CVs.git {os.path.sep.join(sys.executable.split('/')[:-2])}/lib/python3.9/site-packages/atmodat_checklib/AtMoDat_CVs
```
%% Cell type:markdown id:3d0c7dc2-4e14-4738-92c5-b8c107916656 tags:
### Data to be checked
In this tutorial, we will check a small subset of CMIP6 data which we gain via `intake`:
%% Cell type:code id:75e90932-4e2f-478c-b7b5-d82b9fd347c9 tags:
``` python
import intake
# Path to master catalog on the DKRZ server
col_url = "https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/dkrz_data-pool_cloudcatalog.yaml"
col_url = "https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"
parent_col=intake.open_catalog(col_url)
list(parent_col)
# Open the catalog with the intake package and name it "col" as short for "collection"
col=parent_col["dkrz_cmip6_disk_netcdf_fromcloud"]
col=parent_col["dkrz_cmip6_disk"]
```
%% Cell type:code id:d30edc41-2561-43b1-879f-5e5d58784e4e tags:
``` python
# We just use the first file from the CMIP6 catalog and copy it to the local disk because we make some experiments from it
download_file=col.df["path"].values[0]
download_file=col.df["uri"].values[0]
!cp {download_file} ./
```
%% Cell type:code id:47e26721-4281-4acd-9205-2eb77b2ac05a tags:
``` python
exp_file=download_file.split('/')[-1]
exp_file
```
%% Cell type:markdown id:f1476f21-6f58-4430-9602-f18d8fa79460 tags:
### Application
The command `run_checks` can be executed from any directory from within the atmodat conda environment.
The atmodat checker contains two modules:
- one that checks the global attributes for compliance with the ATMODAT standard
- another that performs a standard CF check (building upon the cfchecks library).
%% Cell type:markdown id:365507aa-33a6-42df-9b35-7ead7da006b6 tags:
Show usage instructions of `run_checks`
%% Cell type:code id:76dabfbf-839b-4dca-844c-514cf82f0b66 tags:
``` python
!run_checks -h
```
%% Cell type:markdown id:2c04701c-bc27-4460-b80e-d32daf4a7376 tags:
The results of the performed checks are provided in the checker_output directory. By default, `run_checks` assumes writing permissions in the path where the atmodat checker is installed. If this is not the case, you must specify an output directory where you possess writing permissions with the `-op output_path`.
In the following block, we set the *output path* to the current working directory which we get via the bash command `pwd`. We apply `run_checks` for the `exp_file` which we downloaded in the chapter before.
%% Cell type:code id:c3ef1468-6ce9-4869-a173-2374eca5bc2c tags:
``` python
cwd=!pwd
cwd=cwd[0]
!run_checks -f {exp_file} -op {cwd} -s
```
%% Cell type:markdown id:13e20408-b6fa-4d39-be02-41db2109c980 tags:
Now, we have a directory `atmodat_checker_output` in the `op`. For each run of `run_checks`, a new directory is created inside of `op` named by the timestamp. Additionally, a directory *latest* always shows the output of the most recent run.
%% Cell type:code id:601f3486-91e2-4ff5-9f8e-324f10f799b5 tags:
``` python
!ls {os.path.sep.join([cwd, "atmodat_checker_output"])}
```
%% Cell type:markdown id:fa5ef2a4-a1da-4fa0-873f-902884ea4db6 tags:
As we ran `run_checks` with the option `-s`, one output is the *short_summary.txt* file which we `cat` in the following:
%% Cell type:code id:9f6c38fd-199b-413e-9821-6535235be83c tags:
``` python
output_dir_string=os.path.sep.join(["atmodat_checker_output","latest"])
output_path=os.path.sep.join([cwd, output_dir_string])
!cat {os.path.sep.join([output_path, "short_summary.txt"])}
```
%% Cell type:markdown id:99d2ba16-52c2-4cb6-b82b-226e75463aab tags:
### Results
The short summary contains information about versions, the timestamp of execution, the ratio of passed checks on attributes and errors written by the CF checker.
- cfchecks routine only issues a warning/information message if variable metadata are completely missing.
- Zero errors in the cfchecks routine does not necessarily mean that a data file is CF compliant!
We can also have a look into the detailled output including the exact error message in the *long_summary_* files which are subdivided into severe levels.
%% Cell type:code id:9600c713-1203-430b-a4a6-bf70ec441221 tags:
``` python
!cat {os.path.sep.join([output_path,"long_summary_recommended.csv"])}
```
%% Cell type:code id:b9fa72d6-6e5f-433a-81f0-40e4cd5a94cd tags:
``` python
!cat {os.path.sep.join([output_path,"long_summary_mandatory.csv"])}
```
%% Cell type:markdown id:b94a7c75-abc6-4792-aa5f-65467c6522de tags:
We can open the *.csv* files with `pandas` to further analyse the output.
%% Cell type:code id:f02ea2c4-7238-4afd-aef0-565aa5a5787f tags:
``` python
import pandas as pd
recommend_df=pd.read_csv(os.path.sep.join([output_path,"long_summary_recommended.csv"]))
recommend_df
```
%% Cell type:markdown id:6453b4ca-288e-4c49-8c93-da4524ef5792 tags:
There may be **missing** global attributes wich are recommended by the *atmodat standard*. We can find them with pandas:
%% Cell type:code id:f0a7e6db-f79a-448f-8046-bb4bf3bcef9d tags:
``` python
missing_recommend_atts=list(
recommend_df.loc[recommend_df["Error Message"]=="global attribute is not present"]["Global Attribute"]
)
missing_recommend_atts
```
%% Cell type:markdown id:06283c25-c5b6-450f-bfe9-d65e8fe26623 tags:
### Curation
Let's try first steps to *cure* the file by adding a missing attribute with `xarray`. We can open the file into an *xarray dataset* with:
%% Cell type:code id:b294cd89-d55c-421f-82e2-4cf42ece7d62 tags:
``` python
import xarray as xr
exp_file_ds=xr.open_dataset(exp_file)
exp_file_ds
```
%% Cell type:markdown id:f02bc09f-94dc-4e0f-b12f-9798549e90e8 tags:
We can **handle and add attributes** via the `dict`-type attribute `.attrs`. Applied on the dataset, it shows all *global attributes* of the file:
%% Cell type:code id:fc0ffe80-4288-4ac3-a599-3239f37f461d tags:
``` python
exp_file_ds.attrs
```
%% Cell type:markdown id:6f61190e-49bc-40da-8b33-30f3debd1895 tags:
We add all missing attributes and set a dummy value for them:
%% Cell type:code id:3fd18adf-fe43-4d47-b565-d082b80b970d tags:
``` python
for att in missing_recommend_atts:
exp_file_ds.attrs[att]="Dummy"
```
%% Cell type:markdown id:56e26094-0ad6-42a9-afaf-5c482ee8ca87 tags:
We save the modified dataset with the `to_netcdf` function:
%% Cell type:code id:8050d724-da0d-417a-992e-24bb5aae0c82 tags:
``` python
exp_file_ds.to_netcdf(exp_file+".modified.nc")
```
%% Cell type:markdown id:5794c6ce-fff2-4c6e-8c08-aaf5dd342f8d tags:
Now, lets run `run_checks` again.
We can also only provide a directory instead of a file as an argument with the option `-p`. The checker will find all `.nc` files inside that directory.
%% Cell type:code id:6c3698f7-62a4-4297-bfbf-d6447a0f006a tags:
``` python
!run_checks -p {cwd} -op {cwd} -s
```
%% Cell type:markdown id:c72647ee-7497-42df-ae68-f6a2d4ea87ad tags:
Using the *latest* directory, here is the new summary:
%% Cell type:code id:51d2eff6-2a31-47b7-a706-f2555e03b9c3 tags:
``` python
!cat {os.path.sep.join([output_path,"short_summary.txt"])}
```
%% Cell type:markdown id:1c9205ec-4f5f-4173-bb0d-1896785a9d04 tags:
You can see that the checks do not fail for the modified file when subtracting the earlier failes from the sum of new passed checks.
......
%% Cell type:markdown id: tags:
# Intake I - find, browse and access `intake-esm` collections
%% Cell type:markdown id: tags:
```{admonition} Overview
:class: dropdown
![Level](https://img.shields.io/badge/Level-Introductory-green.svg)
🎯 **objectives**: Learn how to use `intake` to find, browse and access `intake-esm` ESM-collections
⌛ **time_estimation**: "30min"
☑️ **requirements**: None
☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`
© **contributors**: k204210
⚖ **license**:
```
%% Cell type:markdown id: tags:
```{admonition} Agenda
:class: tip
In this part, you learn
1. [Motivation of intake-esm](#motivation)
1. [Features of intake and intake-esm](#features)
1. [Browse through catalogs](#browse)
1. [Data access via intake-esm](#dataaccess)
```
%% Cell type:markdown id: tags:
<a class="anchor" id="motivation"></a>
We follow here the guidance presented by `intake-esm` on its [repository](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html).
## Motivation of intake-esm
> Simulations of the Earth’s climate and weather generate huge amounts of data. These data are often persisted on different storages in a variety of formats (netCDF, zarr, etc...). Finding, investigating, loading these data assets into compute-ready data containers costs time and effort. The data user needs to know what data sets are available, the attributes describing each data set, before loading a specific data set and analyzing it.
> `Intake-esm` addresses these issues by providing necessary functionality for **searching, discovering, data access and data loading**.
%% Cell type:markdown id: tags:
For intake users, many data preparation tasks **are no longer necessary**. They do not need to know:
- 🌍 where data is saved
- 🪧 how data is saved
- 📤 how data should be loaded
but still can search, discover, access and load data of a project.
%% Cell type:markdown id: tags:
<a class="anchor" id="features"></a>
## Features of intake and intake-esm
Intake is a generic **cataloging system** for listing data sources. As a plugin, `intake-esm` is built on top of `intake`, `pandas`, and `xarray` and configures `intake` such that it is able to also **load and process** ESM data.
- display catalogs as clearly structured tables 📄 inside jupyter notebooks for easy investigation
- browse 🔍 through the catalog and select your data without
- being next to the data (e.g. logged in on dkrz's luv)
- knowing the project's data reference syntax i.e. the storage tree hierarchy and path and file name templates
- open climate data in an analysis ready dictionary of `xarray` datasets 🎁
%% Cell type:markdown id: tags:
All required information for searching, accessing and loading the catalog's data is configured within the catalogs:
- 🌍 where data is saved
* users can browse data without knowing the data storage platform including e.g. the root path of the project and the directory syntax
* Data of different platforms (cloud or disk) can be combined in one catalog
* On mid term, intake catalogs can be **a single point of access**
- 🪧 how data is saved
* users can work with a *xarray* dataset representation of the data no matter whether it is saved in **grb, netcdf or zarr** format.
* catalogs can contain more information an therefore more search facets than obvious from names and pathes of the data.
- 📤 how data should be loaded
* users work with an **aggregated** *xarray* dataset representation which merges files/assets perfectly fitted to the project's data model design.
* with *xarray* and the underlying *dask* library, data which are **larger than the RAM** can be loaded
%% Cell type:markdown id: tags:
In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.
CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.
The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.
%% Cell type:code id: tags:
``` python
#note that intake_esm is imported with `import intake` as a plugin
import intake
```
%% Cell type:markdown id: tags:
<a class="anchor" id="terminology"></a>
## Terminology: **Catalog**, **Catalog file** and **Collection**
We align our wording with `intake`'s [*glossary*](https://intake.readthedocs.io/en/latest/glossary.html) which is still evolving. The names overlap with other definitions, making it difficult to keep track. Here we try to give an overview of the hierarchy of catalog terms:
- a **top level catalog file** 📋 is the **main** catalog of an institution which will be opened first. It contains other project [*catalogs*](#catalog) 📖 📖 📖. Such catalogs can be assigned an [*intake driver*](#intakedriver) which is used to open and load the catalog within the top level catalog file. Technically, a catalog file 📋 is <a class="anchor" id="catalogfile"></a>
- is a `.yaml` file
- can be opened with `open_catalog`, e.g.:
```python
intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
```
- **intake driver**s also named **plugin**s are specified for [*catalogs*](#catalog) becaues they load specific data sets. <a class="anchor" id="intakedriver"></a>
%% Cell type:markdown id: tags:
- a **catalog** 📖 (or collection) is defined by two parts: <a class="anchor" id="catalog"></a>
- a **description** of a group of data sets. It describes how to *load* **assets** of the data set(s) with the specified [driver](#intakedriver). This group forms an entity. E.g., all CMIP6 data sets can be collected in a catalog. <a class="anchor" id="description"></a>
- an **asset** is most often a file. <a class="anchor" id="asset"></a>
- a **collection** of all [assets](#asset) of the data set(s). <a class="anchor" id="collection"></a>
- the collection can be included in the catalog or separately saved in a **data base** 🗂. In the latter case, the catalog references the data base, e.g.:
```json
"catalog_file": "/mnt/lustre02/work/ik1017/Catalogs/dkrz_cmip6_disk.csv.gz"
```
```{note}
The term *collection* is often used synonymically for [catalog](#catalog).
```
%% Cell type:markdown id: tags:
- a *intake-esm* catalog 📖 is a `.json` file and can be opened with intake-esm's function `intake.open_esm_datastore()`, e.g:
```python
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
intake.open_esm_datastore("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_cmip6_disk.json")
```
%% Cell type:markdown id: tags:
<a class="anchor" id="browse"></a>
## Open and browse through catalogs
We begin with using only *intake* functions for catalogs. Afterwards, we continue with concrete *intake-esm* utilites.
intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`:
intake **opens** catalog-files in `yaml` format. These contain information about additonal sources: other catalogs/collections which will be loaded with specific *plugins*/*drivers*. The command is `open_catalog`.
<mark> You only need to remember one URL as the *single point of access* for DKRZ's intake catalogs: The DKRZ top level catalog can be accessed via dkrz.de/s/intake . Since intake does not follow the redirect, you can use **requests** to work with that link:</mark>
%% Cell type:code id: tags:
``` python
import requests
r=requests.get("http://dkrz.de/s/intake")
print(r.url)
```
%% Cell type:code id: tags:
``` python
dkrz_catalog=intake.open_catalog("https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml")
dkrz_catalog=intake.open_catalog(r.url)
```
%% Cell type:markdown id: tags:
```{note}
Right now, two versions of the top level catalog file exist: One for accessing the catalog via [cloud](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud_access/dkrz_catalog.yaml), one for via [disk](https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/disk_access/dkrz_catalog.yaml). They however contain **the same content**.
```
%% Cell type:markdown id: tags:
We can look into the catalog with `print` and `list`
%% Cell type:code id: tags:
``` python
print(dkrz_catalog.yaml())
```
%% Cell type:code id: tags:
``` python
list(dkrz_catalog)
```
%% Cell type:markdown id: tags:
Over the time, many collections have been created. `dkrz_catalog` is a **main** catalog prepared to keep an overview of all other collections. `list` shows all sub **project catalogs** which are available at DKRZ.
All these catalogs are **intake-esm** catalogs.
Let's have a look into a master catalog of [Pangeo](https://pangeo.io/):
%% Cell type:code id: tags:
``` python
pangeo=intake.open_catalog("https://raw.githubusercontent.com/pangeo-data/pangeo-datastore/master/intake-catalogs/master.yaml")
```
%% Cell type:code id: tags:
``` python
pangeo
```
%% Cell type:code id: tags:
``` python
list(pangeo)
```
%% Cell type:markdown id: tags:
While DKRZ's master catalog has one sublevel, Pangeo's is a nested one. We can access another `yaml` catalog which is also a **parent** catalog by simply:
%% Cell type:code id: tags:
``` python
pangeo.climate
```
%% Cell type:markdown id: tags:
Pangeo's ESM collections are one level deeper in the catalog tree:
%% Cell type:code id: tags:
``` python
list(pangeo.climate)
```
%% Cell type:markdown id: tags:
The DKRZ ESM-Collections follow a name template:
`dkrz_${project}_${store}[_${auxiliary_catalog}]`
where
- **project** can be one of the *model intercomparison project* and one of `cmip6`, `cmip5`, `cordex`, `era5` or `mpi-ge`.
- **store** is the data store and can be one of:
- `disk`: DKRZ holds a lot of data on a consortial disk space on the file system of the High Performance Computer (HPC) where it is accessible for every HPC user. If you use this ESM Collection, you have to work on the HPC if you want to load the data. Browsing and discovering will work independently from your work station.
- `cloud`: A small subset is transferred into DKRZ's cloud in order to test the performance. swift is DKRZ's cloud storage.
- `archive`: A lot of data exists in the band archive of DKRZ. Before it can be accessed, it has to be retrieved. Therefore, catalogs for `hsm` are limited in functionality but still convenient for data browsing.
- **auxiliary_catalog** can be *grid*
%% Cell type:markdown id: tags:
**Best practice for naming catalogs**:
- Use small letters for all values
- Do **NOT** use `_` as a separator in values
- Do not repeat values of other attributes ("dkrz_dkrz-dyamond")
%% Cell type:markdown id: tags:
### The role of `intake-esm`
We now look into a catalog which is opened by the plugin `intake-esm`.
> An ESM (Earth System Model) collection file is a `JSON` file that conforms to the ESM Collection Specification. When provided a link/path to an esm collection file, intake-esm establishes a link to a database (`CSV` file) that contains data assets locations and associated metadata (i.e., which experiment, model, the come from).
Since the data base of the CMIP6 ESM Collection is about 100MB in compressed format, it takes up to a minute to load the catalog.
%% Cell type:code id: tags:
``` python
esm_col=dkrz_catalog.dkrz_cmip6_disk
print(esm_col)
```
%% Cell type:markdown id: tags:
`intake-esm` gives us an overview over the content of the ESM collection. The ESM collection is a data base described by specific attributes which are technically columns. Each project data standard is the basis for the columns and used to parse information given by the path and file names.
The pure display of `esm_col` shows us the number of unique values in each column. Since each `uri` refers to one file, we can conclude that the DKRZ-CMIP6 ESM Collection contains **6.08 Mio Files** in 2022.
%% Cell type:markdown id: tags:
The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:
%% Cell type:code id: tags:
``` python
esm_col.df.head()
```
%% Cell type:markdown id: tags:
We can find out details about `esm_col` with the object's attributes. `esm_col.esmcol_data` contains all information given in the `JSON` file. We can also focus on some specific attributes.
%% Cell type:code id: tags:
``` python
#esm_col.esmcol_data
```
%% Cell type:code id: tags:
``` python
print("What is this catalog about? \n" + esm_col.esmcol_data["description"])
#
print("The link to the data base: "+ esm_col.esmcol_data["catalog_file"])
```
%% Cell type:markdown id: tags:
Advanced: To find out how many datasets are available, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one):
%% Cell type:code id: tags:
``` python
cat = esm_col.df.drop(['uri','time_range'],1).drop_duplicates(keep="first")
print(len(cat))
```
%% Cell type:markdown id: tags:
### Browse through the data of the ESM collection