From c7dfd64f07c1e31db534b3473e4e7feffb573647 Mon Sep 17 00:00:00 2001
From: Fabian Wachsmann <wachsmann@dkrz.de>
Date: Wed, 22 Jun 2022 13:16:06 +0000
Subject: [PATCH] Setup for ci

---
 .../demo/tutorial_intake-1-introduction.ipynb | 65 +++++++++++++++----
 ..._intake-4-preprocessing-derived-vars.ipynb | 10 ---
 ...nced_summer_days_intake_xarray_cmip6.ipynb |  6 +-
 ...ulate-frost-days_intake-xarray_cmip6.ipynb |  6 +-
 ...se-case_climate-extremes-indices_cdo.ipynb |  8 ++-
 ...vert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb |  6 +-
 ...nsemble-analysis_intake-xarray_cmip6.ipynb |  7 +-
 ...rly-mean-anomaly_xarray-hvplot_cmip6.ipynb |  7 +-
 ...case_plot-unstructured_psyplot_cmip6.ipynb |  7 +-
 9 files changed, 85 insertions(+), 37 deletions(-)

diff --git a/notebooks/demo/tutorial_intake-1-introduction.ipynb b/notebooks/demo/tutorial_intake-1-introduction.ipynb
index 53ae65c..4ea3584 100644
--- a/notebooks/demo/tutorial_intake-1-introduction.ipynb
+++ b/notebooks/demo/tutorial_intake-1-introduction.ipynb
@@ -425,7 +425,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:"
+    "The data base is loaded into an underlying `panda`s dataframe which we can access with `esm_col.df`. `esm_col.df.head()` displays the first rows of the table:"
    ]
   },
   {
@@ -639,12 +639,14 @@
    "source": [
     "### How to load more columns\n",
     "\n",
-    "If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The opendap_url is an *additional* column that can also be loaded.\n",
+    "Intake allows to load only a subset of the columns that is inside the **intake-esm** catalog. Since the memory usage of **intake-esm** is high, the default columns are only a subset from all possible columns. Sometimes, other columns are of interest:\n",
+    "\n",
+    "If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The *opendap_url* is an *additional* column that can also be loaded.\n",
     "\n",
     "We can define 3 different column name types for the usage of intake catalogs:\n",
     "\n",
     "1. **Default** attributes which are loaded from the main catalog and which can be seen via `_entries[CATNAME]._open_args`.\n",
-    "2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata[\"parameters\"][\"cataloonie_columns\"]`\n",
+    "2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata[\"parameters\"][\"cataloonie_columns\"]`. With these template attributes, there may be redundancy in the columns. They exist to simplify merging catalogs across projects.\n",
     "3. **Additional** attributes which are not necessary to identify a single asset but helpful for users. You can find these via\n",
     "\n",
     "`dkrz_catalog.metadata[\"parameters\"][\"additional_PROJECT_columns\"]`\n",
@@ -670,13 +672,6 @@
     "```"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "There is a lot of redundancy in the columns. That is because they exist to be conform to other kind of standards. This will simplify merging catalogs across projects."
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -711,6 +706,14 @@
     "esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- ⭐ The customization of catalog columns allows highest flexibility for intake users. \n",
+    "- ⭐ In theory, we could add many more columns with additional information because ot all have to be loaded from the data base."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -750,7 +753,7 @@
     "query = dict(\n",
     "    variable_id=\"tas\",\n",
     "    table_id=\"Amon\",\n",
-    "    source_id=\"MPI-ESM1-2-HR\",\n",
+    "    source_id=\"MPI-ESM1-2-LR\",\n",
     "    experiment_id=\"historical\")\n",
     "cat = esm_col.search(**query)\n",
     "cat"
@@ -846,13 +849,31 @@
     "- The `time_range` column was used to **concat** data along the `time` dimension\n",
     "- The `member_id` column was used to generate a new dimension\n",
     "\n",
-    "The underlying `dask` package will only load the data into memory if needed."
+    "The underlying `dask` package will only load the data into memory if needed. Note that attributes which disagree from file to file, e.g. *tracking_id*, are excluded from the dataset."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "How **intake-esm** should open and aggregate the assets is configured in the *aggregation_control* part of the description:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(esm_col.esmcol_data[\"aggregation_control\"][\"aggregations\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Columns can be defined for appending or creating new dimensions. The *options* are keyword arguments for xarray.\n",
+    "\n",
     "They **keys** of the dictionary are made with column values defined in the *aggregation_control* of the **intake-esm** catalog. These will determine the **key_template**. The corresponding commands are:"
    ]
   },
@@ -906,7 +927,25 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "#### Pangeo's data store\n",
+    "### Troubleshooting\n",
+    "\n",
+    "The variables are collected in **one** dataset. This requires that **the dimensions and coordinates must be the same over all files**. Otherwise, xarray cannot merge these together.\n",
+    "\n",
+    "For CMIP6, most of the variables collected in one **table_id** should be on the same dimensions and coordinates. Unfortunately, there are exceptions.: \n",
+    "\n",
+    "- a few variables are requested for *time slices* only. \n",
+    "- sometimes models use different dimension names from file to file\n",
+    "\n",
+    "Using the [preprocessing](https://tutorials.dkrz.de/tutorial_intake-4-preprocessing-derived-vars.html#use-preprocessing-when-opening-assets-and-creating-datasets) keyword argument can help to rename dimensions before merging.\n",
+    "\n",
+    "For Intake providers: the more information on the dimensions and coordinates provided already in the catalog, the better the aggregation control."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pangeo's data store\n",
     "\n",
     "Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before."
    ]
diff --git a/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb b/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb
index 0829294..2eaaf6c 100644
--- a/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb
+++ b/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb
@@ -63,16 +63,6 @@
     "esm_dkrz=dkrz_cdp.dkrz_cmip6_disk"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#levante uri to mistral uri:\n",
-    "esm_dkrz.df[\"uri\"]=esm_dkrz.df[\"uri\"].str.replace(\"lustre/\",\"lustre02/\")"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
diff --git a/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb b/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb
index c0afbad..7b2301e 100644
--- a/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb
+++ b/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb
@@ -170,8 +170,10 @@
    "outputs": [],
    "source": [
     "# Path to master catalog on the DKRZ server\n",
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "parent_col=intake.open_catalog([col_url])\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(parent_col)\n",
     "\n",
     "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n",
diff --git a/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb b/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb
index a8c961d..57f3022 100644
--- a/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb
+++ b/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb
@@ -93,8 +93,10 @@
    "outputs": [],
    "source": [
     "# Path to master catalog on the DKRZ server\n",
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "parent_col=intake.open_catalog([col_url])\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(parent_col)\n",
     "\n",
     "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n",
diff --git a/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb b/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb
index fe45e23..1b50e5f 100755
--- a/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb
+++ b/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb
@@ -91,12 +91,14 @@
    "source": [
     "import intake\n",
     "# Path to master catalog on the DKRZ server\n",
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "dkrz_catalog=intake.open_catalog([col_url])\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(dkrz_catalog)\n",
     "\n",
     "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n",
-    "cols=dkrz_catalog.metadata[\"parameters\"][\"cmip6_columns\"][\"default\"]+[\"opendap_url\"]\n",
+    "cols=dkrz_catalog._entries[\"dkrz_cmip6_disk\"]._open_args[\"csv_kwargs\"][\"usecols\"]+[\"opendap_url\"]\n",
     "col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))"
    ]
   },
diff --git a/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb b/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb
index 191c789..956cf98 100644
--- a/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb
+++ b/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb
@@ -54,7 +54,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "dkrz_catalog = intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "# Path to master catalog on the DKRZ server\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "# Print DKRZ open catalogues\n",
     "list(dkrz_catalog)"
    ]
diff --git a/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb b/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb
index 259793f..c0d5abd 100644
--- a/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb
+++ b/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb
@@ -119,8 +119,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "parent_col=intake.open_catalog([col_url])\n",
+    "# Path to master catalog on the DKRZ server\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(parent_col)"
    ]
   },
diff --git a/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb b/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb
index 833c913..08b55ca 100644
--- a/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb
+++ b/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb
@@ -96,8 +96,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "parent_col=intake.open_catalog([col_url])\n",
+    "# Path to master catalog on the DKRZ server\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(parent_col)"
    ]
   },
diff --git a/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb b/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb
index e508cd7..b83a0cc 100644
--- a/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb
+++ b/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb
@@ -56,8 +56,11 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "col_url = \"https://dkrz.de/s/intake\"\n",
-    "parent_col=intake.open_catalog([col_url])\n",
+    "# Path to master catalog on the DKRZ server\n",
+    "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
+    "#\n",
+    "#only for the web page we need to take the original link:\n",
+    "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n",
     "list(parent_col)"
    ]
   },
-- 
GitLab