From c7dfd64f07c1e31db534b3473e4e7feffb573647 Mon Sep 17 00:00:00 2001 From: Fabian Wachsmann <wachsmann@dkrz.de> Date: Wed, 22 Jun 2022 13:16:06 +0000 Subject: [PATCH] Setup for ci --- .../demo/tutorial_intake-1-introduction.ipynb | 65 +++++++++++++++---- ..._intake-4-preprocessing-derived-vars.ipynb | 10 --- ...nced_summer_days_intake_xarray_cmip6.ipynb | 6 +- ...ulate-frost-days_intake-xarray_cmip6.ipynb | 6 +- ...se-case_climate-extremes-indices_cdo.ipynb | 8 ++- ...vert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb | 6 +- ...nsemble-analysis_intake-xarray_cmip6.ipynb | 7 +- ...rly-mean-anomaly_xarray-hvplot_cmip6.ipynb | 7 +- ...case_plot-unstructured_psyplot_cmip6.ipynb | 7 +- 9 files changed, 85 insertions(+), 37 deletions(-) diff --git a/notebooks/demo/tutorial_intake-1-introduction.ipynb b/notebooks/demo/tutorial_intake-1-introduction.ipynb index 53ae65c..4ea3584 100644 --- a/notebooks/demo/tutorial_intake-1-introduction.ipynb +++ b/notebooks/demo/tutorial_intake-1-introduction.ipynb @@ -425,7 +425,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "The data base is loaded into an underlying `panda`s dataframe which we can access with `col.df`. `col.df.head()` displays the first rows of the table:" + "The data base is loaded into an underlying `panda`s dataframe which we can access with `esm_col.df`. `esm_col.df.head()` displays the first rows of the table:" ] }, { @@ -639,12 +639,14 @@ "source": [ "### How to load more columns\n", "\n", - "If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The opendap_url is an *additional* column that can also be loaded.\n", + "Intake allows to load only a subset of the columns that is inside the **intake-esm** catalog. Since the memory usage of **intake-esm** is high, the default columns are only a subset from all possible columns. Sometimes, other columns are of interest:\n", + "\n", + "If you work remotely away from the data, you can use the **opendap_url**'s to access the subset of interest for all files published at DKRZ. The *opendap_url* is an *additional* column that can also be loaded.\n", "\n", "We can define 3 different column name types for the usage of intake catalogs:\n", "\n", "1. **Default** attributes which are loaded from the main catalog and which can be seen via `_entries[CATNAME]._open_args`.\n", - "2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata[\"parameters\"][\"cataloonie_columns\"]`\n", + "2. **Overall** attributes or **template** attributes which should be defined for **ALL** catalogs at DKRZ (exceptions excluded). At DKRZ, we use the newly defined **Cataloonie** scheme template which can be found via `dkrz_catalog.metadata[\"parameters\"][\"cataloonie_columns\"]`. With these template attributes, there may be redundancy in the columns. They exist to simplify merging catalogs across projects.\n", "3. **Additional** attributes which are not necessary to identify a single asset but helpful for users. You can find these via\n", "\n", "`dkrz_catalog.metadata[\"parameters\"][\"additional_PROJECT_columns\"]`\n", @@ -670,13 +672,6 @@ "```" ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "There is a lot of redundancy in the columns. That is because they exist to be conform to other kind of standards. This will simplify merging catalogs across projects." - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -711,6 +706,14 @@ "esm_col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- â The customization of catalog columns allows highest flexibility for intake users. \n", + "- â In theory, we could add many more columns with additional information because ot all have to be loaded from the data base." + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -750,7 +753,7 @@ "query = dict(\n", " variable_id=\"tas\",\n", " table_id=\"Amon\",\n", - " source_id=\"MPI-ESM1-2-HR\",\n", + " source_id=\"MPI-ESM1-2-LR\",\n", " experiment_id=\"historical\")\n", "cat = esm_col.search(**query)\n", "cat" @@ -846,13 +849,31 @@ "- The `time_range` column was used to **concat** data along the `time` dimension\n", "- The `member_id` column was used to generate a new dimension\n", "\n", - "The underlying `dask` package will only load the data into memory if needed." + "The underlying `dask` package will only load the data into memory if needed. Note that attributes which disagree from file to file, e.g. *tracking_id*, are excluded from the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ + "How **intake-esm** should open and aggregate the assets is configured in the *aggregation_control* part of the description:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(esm_col.esmcol_data[\"aggregation_control\"][\"aggregations\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Columns can be defined for appending or creating new dimensions. The *options* are keyword arguments for xarray.\n", + "\n", "They **keys** of the dictionary are made with column values defined in the *aggregation_control* of the **intake-esm** catalog. These will determine the **key_template**. The corresponding commands are:" ] }, @@ -906,7 +927,25 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#### Pangeo's data store\n", + "### Troubleshooting\n", + "\n", + "The variables are collected in **one** dataset. This requires that **the dimensions and coordinates must be the same over all files**. Otherwise, xarray cannot merge these together.\n", + "\n", + "For CMIP6, most of the variables collected in one **table_id** should be on the same dimensions and coordinates. Unfortunately, there are exceptions.: \n", + "\n", + "- a few variables are requested for *time slices* only. \n", + "- sometimes models use different dimension names from file to file\n", + "\n", + "Using the [preprocessing](https://tutorials.dkrz.de/tutorial_intake-4-preprocessing-derived-vars.html#use-preprocessing-when-opening-assets-and-creating-datasets) keyword argument can help to rename dimensions before merging.\n", + "\n", + "For Intake providers: the more information on the dimensions and coordinates provided already in the catalog, the better the aggregation control." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Pangeo's data store\n", "\n", "Let's have a look into Pangeo's ESM Collection as well. This is accessible via cloud from everywhere - you only need internet to load data. We use the same `query` as in the example before." ] diff --git a/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb b/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb index 0829294..2eaaf6c 100644 --- a/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb +++ b/notebooks/demo/tutorial_intake-4-preprocessing-derived-vars.ipynb @@ -63,16 +63,6 @@ "esm_dkrz=dkrz_cdp.dkrz_cmip6_disk" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "#levante uri to mistral uri:\n", - "esm_dkrz.df[\"uri\"]=esm_dkrz.df[\"uri\"].str.replace(\"lustre/\",\"lustre02/\")" - ] - }, { "cell_type": "markdown", "metadata": {}, diff --git a/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb b/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb index c0afbad..7b2301e 100644 --- a/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb +++ b/notebooks/demo/use-case_advanced_summer_days_intake_xarray_cmip6.ipynb @@ -170,8 +170,10 @@ "outputs": [], "source": [ "# Path to master catalog on the DKRZ server\n", - "col_url = \"https://dkrz.de/s/intake\"\n", - "parent_col=intake.open_catalog([col_url])\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(parent_col)\n", "\n", "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n", diff --git a/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb b/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb index a8c961d..57f3022 100644 --- a/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb +++ b/notebooks/demo/use-case_calculate-frost-days_intake-xarray_cmip6.ipynb @@ -93,8 +93,10 @@ "outputs": [], "source": [ "# Path to master catalog on the DKRZ server\n", - "col_url = \"https://dkrz.de/s/intake\"\n", - "parent_col=intake.open_catalog([col_url])\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(parent_col)\n", "\n", "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n", diff --git a/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb b/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb index fe45e23..1b50e5f 100755 --- a/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb +++ b/notebooks/demo/use-case_climate-extremes-indices_cdo.ipynb @@ -91,12 +91,14 @@ "source": [ "import intake\n", "# Path to master catalog on the DKRZ server\n", - "col_url = \"https://dkrz.de/s/intake\"\n", - "dkrz_catalog=intake.open_catalog([col_url])\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(dkrz_catalog)\n", "\n", "# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n", - "cols=dkrz_catalog.metadata[\"parameters\"][\"cmip6_columns\"][\"default\"]+[\"opendap_url\"]\n", + "cols=dkrz_catalog._entries[\"dkrz_cmip6_disk\"]._open_args[\"csv_kwargs\"][\"usecols\"]+[\"opendap_url\"]\n", "col=dkrz_catalog.dkrz_cmip6_disk(csv_kwargs=dict(usecols=cols))" ] }, diff --git a/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb b/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb index 191c789..956cf98 100644 --- a/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb +++ b/notebooks/demo/use-case_convert-nc-to-tiff_rioxarray-xesmf_cmip.ipynb @@ -54,7 +54,11 @@ "metadata": {}, "outputs": [], "source": [ - "dkrz_catalog = intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "# Path to master catalog on the DKRZ server\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "# Print DKRZ open catalogues\n", "list(dkrz_catalog)" ] diff --git a/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb b/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb index 259793f..c0d5abd 100644 --- a/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb +++ b/notebooks/demo/use-case_ensemble-analysis_intake-xarray_cmip6.ipynb @@ -119,8 +119,11 @@ "metadata": {}, "outputs": [], "source": [ - "col_url = \"https://dkrz.de/s/intake\"\n", - "parent_col=intake.open_catalog([col_url])\n", + "# Path to master catalog on the DKRZ server\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(parent_col)" ] }, diff --git a/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb b/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb index 833c913..08b55ca 100644 --- a/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb +++ b/notebooks/demo/use-case_global-yearly-mean-anomaly_xarray-hvplot_cmip6.ipynb @@ -96,8 +96,11 @@ "metadata": {}, "outputs": [], "source": [ - "col_url = \"https://dkrz.de/s/intake\"\n", - "parent_col=intake.open_catalog([col_url])\n", + "# Path to master catalog on the DKRZ server\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(parent_col)" ] }, diff --git a/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb b/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb index e508cd7..b83a0cc 100644 --- a/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb +++ b/notebooks/demo/use-case_plot-unstructured_psyplot_cmip6.ipynb @@ -56,8 +56,11 @@ "metadata": {}, "outputs": [], "source": [ - "col_url = \"https://dkrz.de/s/intake\"\n", - "parent_col=intake.open_catalog([col_url])\n", + "# Path to master catalog on the DKRZ server\n", + "#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n", + "#\n", + "#only for the web page we need to take the original link:\n", + "parent_col=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])\n", "list(parent_col)" ] }, -- GitLab