updated intake

87adf72c · Fabian Wachsmann · 4f599680 · 87adf72c
Commit 87adf72c authored 3 years ago by Fabian Wachsmann
--- a/notebooks/demo/tutorial_intake-1-2-dkrz-catalogs.ipynb
+++ b/notebooks/demo/tutorial_intake-1-2-dkrz-catalogs.ipynb
@@ -71,8 +71,6 @@
   "source": [
    "<a class=\"anchor\" id=\"examples\"></a>\n",
    "\n",
-    "## Closer look into DKRZ intake-esm catalogs for project data\n",
-    "\n",
    "By 2022, we offer you project data for\n",
    "\n",
    "```{tabbed} CMIP6-like projects \n",
@@ -157,6 +155,7 @@
    "\n",
    "- **Way of data access**:\n",
    "    - `uri` contain *links* to datasets in dkrz's swift cloud storage which can be opened with `xarray` and therefore `intake`\n",
+    "    - If the asset's `format` is *zarr* (specified in the `format` column), use `zarr_kwargs` in the `to_dataset_dict()` function\n",
    "- **User requirements for data access**:\n",
    "    - None. Users can access it from everywhere if internet connection is sufficient\n",
    "- **Provider requirements**:\n",
@@ -167,9 +166,9 @@
    "```{tabbed} Archive\n",
    "\n",
    "- **Way of data access**:\n",
-    "    - `uri` is empty i.e. data access is not possible\n",
+    "    - `uri` is empty i.e. no direct data access via intake is possible. If the catalog contains a `jblob_file` column, users can however download the data via *jblob* on levante (see next point).\n",
    "- **User requirements for data access**:\n",
-    "    - users can download the data via the values in the `jblob_file` column. After loading the module on levante via `module load jblob`,  e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`\n",
+    "    - users muste be **logged in** to levante to *access* the data. After loading the module on levante via `module load jblob`,  e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`\n",
    "- **Provider requirements**:\n",
    "    - None\n",
    "\n",

 %% Cell type:markdown id: tags:

 # Intake I - find, browse and access `intake-esm` collections

 %% Cell type:markdown id: tags:

 ```{admonition} Overview
 :class: dropdown

 ![Level](https://img.shields.io/badge/Level-Introductory-green.svg)


 🎯 **objectives**: Learn what `intake-esm` ESM-collections DKRZ offer

 ⌛ **time_estimation**: "15min"

 ☑️ **requirements**: None

 © **contributors**: k204210

 ⚖ **license**:

 ```

 %% Cell type:markdown id: tags:

 ```{admonition} Agenda
 :class: tip

 In this part, you learn

 1. [DKRZ intake-esm catalogs for project data](#examples)
 1. [Catalog dependencies on different stores](#stores)
 1. [Workflow at Levante for collecting and merging catalogs into main catalog](#workflow)

 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="examples"></a>

 ## Closer look into DKRZ intake-esm catalogs for project data

 - You will find only **one version** for each *atomic dataset* in each catalog. This is the most recent one available in the store.
 - All catalogs try to follow the *cataloonies schema* for dkrz catalogs
 - Across the project you will find **redundant** attributes in the catalogs which have the same meaning, e.g:
    - source_id, model_id, model
    - member_id, ensemble_member, simulation_id

 %% Cell type:markdown id: tags:

 <a class="anchor" id="examples"></a>

-## Closer look into DKRZ intake-esm catalogs for project data
-
 By 2022, we offer you project data for

 ```{tabbed} CMIP6-like projects

 - 💾  **projects** :
    - [CMIP6](https://c6de.dkrz.de/)
    - [PalMod2](https://www.palmod.de/)
 - **Data location**:
    - is hosted within the [cmip-data-pool](cmip-data-pool.dkrz.de) accessible for all dkrz users under `/pool/data/PROJECT`
 - **Attributes**:
    - the catalog's attributes are explained [here](https://goo.gl/v1drZl)
    - `source_id` and `experiment_id` are the most important attributes. A unique `source_id` refers to one and only the one same model. Different institutions can use it but it is the same model. An `experiment_id` can be found only in *one activity*.
    - values of `member_id` are rather arbitrary. Some member might never be published, others have been retracted. There is no gurantee in having *r1i1p1f1* availabe.
 - **Variable definition**:
    - a **unique variable** is a combination of `table_id` and `variable_id`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.

 ```

 ```{tabbed} CMIP5-like projects

 - 💾  **projects** :
    - CMIP5:
    - [CORDEX](https://is-enes-data.github.io/cordex_archive_specifications.pdf)
    - MPI-GE
 - **Data location**:
    - CMIP5:
        - Only a small subset is still available on the pool's common and shared disk resource due to a lack of disk storage capacity. But most of the data have been archived and can be accessed via `jblob`.
    - CORDEX:
        - On disk, CORDEX data are disseminated across different storage projects
 - **Attributes**:
    - In comparison to CMIP6-like projects, such projects build on the older CMIP5 standard. Therefore, some attributes have different names.
    - Regional model data includes additional attributes incomparison to CMIP5: `CORDEX_domain`, `driving_model_id` and `rcm_version_id`.
 - **Variable definition**:
    - A **unique variable** is a combination of `mip_table` and `variable`. A variable can look different from one table to another. I.e., a variable can have different dimensions for monthly frequency than it has for daily frequency.
 ```

 ```{tabbed} Other projects

 - 💾  **projects** :
    - [Dyamond]()
    - [ERA5]():
 - **Data location**:
    - For disk projects, mostly linked under `/pool/data`
 - **Attributes**:
    - We try to use the cataloonies schema for catalogs of all other projects. For reanalysis products, it cannot be entirely fulfilled as the data is available in GRIB format.


 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="examples"></a>

 ## Catalog dependencies on different stores

 DKRZ's catalog naming convention distinguishes between the different storage formats for as long as data access to stores like archive is either not possible or very different to disk access. The specialties of the storages are explained in the following.

 %% Cell type:markdown id: tags:

 ```{tabbed} Disk

 - **Way of data access**:
    - `uri` contain *pathes* on levante's lustre filesystem
 - **User requirements for data access**:
    - users muste be **logged in** to levante to *access* the data (exception: opendap_url column, see [introduction]())
 - **Provider requirements**:
    - pathes of a valid catalog must be *readable for everyone*

 ```

 ```{tabbed} Cloud

 - **Way of data access**:
    - `uri` contain *links* to datasets in dkrz's swift cloud storage which can be opened with `xarray` and therefore `intake`
+    - If the asset's `format` is *zarr* (specified in the `format` column), use `zarr_kwargs` in the `to_dataset_dict()` function
 - **User requirements for data access**:
    - None. Users can access it from everywhere if internet connection is sufficient
 - **Provider requirements**:
    - links in a valid catalog must point at datasets in **open containers**

 ```

 ```{tabbed} Archive

 - **Way of data access**:
-    - `uri` is empty i.e. data access is not possible
+    - `uri` is empty i.e. no direct data access via intake is possible. If the catalog contains a `jblob_file` column, users can however download the data via *jblob* on levante (see next point).
 - **User requirements for data access**:
-    - users can download the data via the values in the `jblob_file` column. After loading the module on levante via `module load jblob`,  e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`
+    - users muste be **logged in** to levante to *access* the data. After loading the module on levante via `module load jblob`,  e.g. for CMIP5, an example command is `jblob --cmip5-file DSET` where dset is a value of `jblob_file`, e.g. `cmip5.output1.BCC.bcc-csm1-1.abrupt4xCO2.fx.atmos.fx.r0i0p0.v1.areacella.areacella_fx_bcc-csm1-1_abrupt4xCO2_r0i0p0.nc`
 - **Provider requirements**:
    - None

 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="examples"></a>

 ## Preparing project catalogs for DKRZ's main catalog

 1. Use attributes of existing catalogs and/or templates in `/pool/data/Catalogs/Templates` but at least `uri`, `format` and `project`.
 1. Make the data referenced in the catalog and the catalog readable for everyone.
 1. Use the naming convention for dkrz catalogs: `dkrz_PROJECT_STORE`.
 1. Link the catalog via `ln -s PATH-TO-YOUR-CATALOG /pool/data/Catalogs/Candidates/YOUR-CATALOG`

 %% Cell type:markdown id: tags:

 Your catalog then will be catched by a cronjob which

 1. tests your catalog
    - against the catalog naming convention
    - open, search and load
    - if for disk, are all `uri` values *readable*?
 1. merges or creates your catalog
    - if a catalog for the specified project exists in `/pool/data/Catalogs/`, they will be merged if possible
    - else, your catalog will be written to `/work/ik1017/Catalogs` and a link will be set in `/pool/data/Catalogs/`

 %% Cell type:markdown id: tags:

 ```{seealso}
 This tutorial is part of a series on `intake`:
 * [Part 1: Introduction](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-1-introduction.html)
 * [Part 2: Modifying and subsetting catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-2-subset-catalogs.html)
 * [Part 3: Merging catalogs](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-3-merge-catalogs.html)
 * [Part 4: Use preprocessing and create derived variables](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-4-preprocessing-derived-variables.html)
 * [Part 5: How to create an intake catalog](https://data-infrastructure-services.gitlab-pages.dkrz.de/tutorials-and-use-cases/tutorial_intake-5-create-esm-collection.html)

 - You can also do another [CMIP6 tutorial](https://intake-esm.readthedocs.io/en/latest/user-guide/cmip6-tutorial.html) from the official intake page.

 ```

 %% Cell type:markdown id: tags:



 %% Cell type:code id: tags:

 ``` python
 ```