Try to exclude era5

ee9aaf09 · Fabian Wachsmann · 3194eb2f · ee9aaf09
Commit ee9aaf09 authored 2 years ago by Fabian Wachsmann
--- a/notebooks/demo/tutorial_intake-1-3-dkrz-catalogs-era5.ipynb
+++ b/notebooks/demo/tutorial_intake-1-3-dkrz-catalogs-era5.ipynb
@@ -6,7 +6,7 @@
   "source": [
    "# Intake I part 3 - DKRZ Catalogs: ERA5 data\n",
    "\n",
-    "DKRZ intake catalogs cover different projects. This notebook describes ERA5 and the catalog for ERA5 data."
+    "DKRZ intake catalogs cover different projects. This notebook describes the data project ERA5 and the catalog for the ERA5 data."
   ]
  },
  {

 %% Cell type:markdown id: tags:

 # Intake I part 3 - DKRZ Catalogs: ERA5 data

-DKRZ intake catalogs cover different projects. This notebook describes ERA5 and the catalog for ERA5 data.
+DKRZ intake catalogs cover different projects. This notebook describes the data project ERA5 and the catalog for the ERA5 data.

 %% Cell type:markdown id: tags:

 ```{admonition} Overview
 :class: dropdown

 ![Level](https://img.shields.io/badge/Level-Introductory-green.svg)


 🎯 **objectives**: Get to know the ERA5 collection

 ⌛ **time_estimation**: "15min"

 ☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`, at least 5GB memory.

 © **contributors**: k204210

 ⚖ **license**:

 ```

 %% Cell type:markdown id: tags:

 ```{admonition} Agenda
 :class: tip

 In this part, you learn

 1. [what ERA5 is](#intro)
 1. [how to find the collection](#find)
 1. [browsing through the ERA5 collection](#browse)
 1. [how to load ERA5 data with intake-esm](#access)

 ```

 %% Cell type:markdown id: tags:

 <a class="anchor" id="intro"></a>

 ## ERA5, its features and use cases

 ERA ('ECMWF Re-Analysis') refers to a series of climate reanalysis datasets produced at the [European Centre for Medium-Range Weather Forecasts](http://www.ecmwf.int). Climate reanalyses combine observations with models to generatÏe consistent time series of multiple climate variables. [ERA5 (ERA fifth generation)](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) is the latest climate reanalysis which is produced by Copernicus Climate Change Service (C3S) at ECMWF. It replaces ERA-Interim and other [predecessor ERA datasets](https://confluence.ecmwf.int/display/CKB/The+family+of+ERA5+datasets?src=contextnavpagetreemode) such as, e.g., ERA-40, ERA-15 and ERA-20C.

 Contracted by the [German Meteorological Service](https://www.dwd.de/DE/Home/home_node.html), the World Data Centre for Climate (WDCC) at DKRZ is the German distributor of a [selection of these data](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html).

 > ERA5 is a global comprehensive reanalysis, from 1979 to near real time. The period 1959 to 1979 was only recently released and is currently being transferred to DKRZ.

 %% Cell type:markdown id: tags:

 ### Features

 - Spatial resolution is about **31 km** globally
 - Dependent on the parameter, the data are stored on a **reduced Gaussian Grid (N320)** <br> or as **spectral coefficients** (with a triangular truncation of **T639**)
 - Provided on 137/37 different **model/pressure** levels
 - Temporal coverage from **1979 up to today** (1959-1979 newly released)
 - Temporal resolution from hourly, daily to monthly

 ### Use cases

 ERA5 data have a broad range of applications, some of which are

 - forcing of (regional) climate models,
 - evaluation of climate models with reanalysis,
 - comparison of weather observations to data of other scientific fields.

 %% Cell type:markdown id: tags:

 ### Further information

 - [General ERA5 data documentation](https://confluence.ecmwf.int/display/CKB/ERA5:+data+documentation)
 - [List of parameters/codes/definitions from the parameter database by code/table numbers](https://apps.ecmwf.int/codes/grib/param-db)
 - [List of params/codes/defs from the parameter DB by parameter types, incl explanations](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings)
 - [Conversion table for accumulated variables (total precipitation/fluxes)](https://confluence.ecmwf.int/pages/viewpage.action?pageId=197702790)
 - [ERA5 data in DKRZ's /pool/data](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html)

 Please mail to data [at] dkrz [dot] de and visit the [DKRZ Webpage](https://www.dkrz.de/up/de-services/de-data-management/de-projects_cooperations/de-era/de-era)


 %% Cell type:markdown id: tags:

 <a class="anchor" id="find"></a>

 ## Find and open the collection

 First of all, we need to import the required packages

 %% Cell type:code id: tags:

 ``` python
 import intake
 ```

 %% Cell type:markdown id: tags:

 We use intake to open the main catalog which includes all project catalogs and sub catalogs.

 `intake` **opens** catalogs for data sources given in `yaml` format. These contain information about plugins and sources required for accessing and loading the data. The command is `open_catalog`:

 %% Cell type:code id: tags:

 ``` python
 #dkrz_catalog=intake.open_catalog(["https://dkrz.de/s/intake"])
 #
 #only for the web page we need to take the original link:
 dkrz_catalog=intake.open_catalog(["https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml"])
 ```

 %% Cell type:markdown id: tags:

 Use `print` and `list` to find out what the catalog contains:

 %% Cell type:code id: tags:

 ``` python
 list(dkrz_catalog)
 ```

 %% Output

    ['dkrz_cmip5_archive',
     'dkrz_cmip5_disk',
     'dkrz_cmip6_cloud',
     'dkrz_cmip6_disk',
     'dkrz_cordex_disk',
     'dkrz_dyamond-winter_disk',
     'dkrz_era5_disk',
     'dkrz_nextgems_disk',
     'dkrz_palmod2_disk']

 %% Cell type:markdown id: tags:

 We now focus on the ERA5 collection

 %% Cell type:code id: tags:

 ``` python
 col=dkrz_catalog.dkrz_era5_disk
 ```

 %% Output

    /sw/spack-levante/mambaforge-4.11.0-0-Linux-x86_64-sobz6z/lib/python3.9/site-packages/intake_esm/utils.py:96: DtypeWarning: Columns (13,14) have mixed types. Specify dtype option on import or set low_memory=False.
      return pd.read_csv(catalog_path, **csv_kwargs), catalog_path

 %% Cell type:markdown id: tags:

 The variable `col` now contains the intake collection that links to DKRZ's /pool/data ERA5 database.

 %% Cell type:code id: tags:

 ``` python
 col.description
 ```

 %% Output

    "This is an ESM collection for ERA5 data accessible on the DKRZ's disk storage system in /work/bk1099/data/"

 %% Cell type:markdown id: tags:

 Now, we print the variable `col` to see information on the data assets properties and associated metadata (e.g. which institution the data come from).

 %% Cell type:code id: tags:

 ``` python
 col
 ```

 %% Output


 %% Cell type:markdown id: tags:

 The ERA5 catalog consists of 16 datasets from about 550k assets/files.

 %% Cell type:markdown id: tags:

 <a class="anchor" id="browse"></a>

 ## ERA5 collection's facets

 The **ERA5 Catalog** enables to browse through the data base using **10 search facets**. We could group them into 4 categories:

 *Basic* data information:
 - `era_id`:   Today, only E5 is available.
 - `dataType`: Two data types are available: **An**alysis data are *pure* analysis and only contain intensive data (like temperature). **F**ore**c**ast data contain extensive data (like precipitation) which are accumulated quantities.
 - `uri`:     Corresponds to the path on DKRZ's HPC file system.

 %% Cell type:markdown id: tags:

 Information on the *type of horizontal level*:
 - `level_type`: Three types are available: **model_level**, **pressure_level** or **surface**

 *Temporal* information. The ERA5 database starts in January 1979 (the years until 1959 are currently being added).
 - `stepType`:            Is the variable accumulated, instantaneous or averaged?
 - `frequency`:           What is the temporal resolution of the data? The database contains hourly, daily and monthly data.
 - `validation_date`:     The date when the analysis is valid.
 - `initialization_date`: The date when the forecast started.

 *Variable* identifier (redundant) and attributes:
 - `code`       : Corresponds to the GRIB code of the variable in the file.
 - `table_id`   : Specifies which GRIB code table associated with the Grib code.

 %% Cell type:markdown id: tags:

 If you require more information on the variables, the catalog can be loaded with more columns. You can find out additional era5 attributes from the main catalog via:

 %% Cell type:code id: tags:

 ``` python
 dkrz_catalog.metadata["parameters"]["additional_era5_columns"]
 ```

 %% Output

    {'default': ['step', 'long_name', 'short_name', 'path', 'units'],
     'type': 'list[str]'}

 %% Cell type:markdown id: tags:

 You can load these into the catalog by providing a keyword argument:

 %% Cell type:code id: tags:

 ``` python
 cols=dkrz_catalog._entries["dkrz_era5_disk"]._open_args["csv_kwargs"]["usecols"]+dkrz_catalog.metadata["parameters"]["additional_era5_columns"]["default"]
 col=dkrz_catalog.dkrz_era5_disk(csv_kwargs=dict(usecols=cols))
 ```

 %% Output

    /sw/spack-levante/mambaforge-4.11.0-0-Linux-x86_64-sobz6z/lib/python3.9/site-packages/intake_esm/utils.py:96: DtypeWarning: Columns (13,14) have mixed types. Specify dtype option on import or set low_memory=False.
      return pd.read_csv(catalog_path, **csv_kwargs), catalog_path

 %% Cell type:markdown id: tags:

 - `short_name` : A short identifier similar to the netCDF variable name.
 - `long_name`  : A longer description of the variable.
 - `units`      : The units of the variable.

 %% Cell type:markdown id: tags:

 We can obtain more information on the individual elements by using e.g.

 %% Cell type:code id: tags:

 ``` python
 col.unique("dataType")
 ```

 %% Output

    {'dataType': {'count': 2, 'values': ['fc', 'an']}}

 %% Cell type:markdown id: tags:

 The ERA5 database has two unique `dataTypes`:
 - **fc** for forecast. All files which contain "12" in their name are *forecast* data.
 - **an** for analysis. All files which contain "00" in their name are *analysis* data.

 %% Cell type:code id: tags:

 ``` python
 col.unique("frequency")
 ```

 %% Output

    {'frequency': {'count': 3, 'values': ['hourly', 'monthly', 'daily']}}

 %% Cell type:markdown id: tags:

 The ERA5 database contains data with **hourly**, **daily** and **monthly** `frequency`.

 %% Cell type:code id: tags:

 ``` python
 col.unique("level_type")
 ```

 %% Output

    {'level_type': {'count': 3,
      'values': ['pressure_level', 'surface', 'model_level']}}

 %% Cell type:markdown id: tags:

 The ERA5 database contains **surface** level (sfc) data. In addition, it contains vertically resolved data at **model levels**
 (ml, 137 levels) and at **pressure levels** (pl, 37 levels).

 %% Cell type:code id: tags:

 ``` python
 col.unique("stepType")
 ```

 %% Output

    {'stepType': {'count': 6,
      'values': ['accum', 'avgid', 'max', 'instant', 'avgad', 'avgua']}}

 %% Cell type:markdown id: tags:

 The ERA5 database covers six `stepType`s. This attribute is parsed from the GRIB attribute `GRIB_stepType`:
 - 'accum'
 - 'max'
 - 'avgua'
 - 'avgad'
 - 'instant'
 - 'avgid'

 %% Cell type:markdown id: tags:

 We can check which combinations of **dataType**, **level_type** and **frequency** exist by using the`groupby` function of the underlying `dataframe`:

 %% Cell type:code id: tags:

 ``` python
 list(col.df.groupby(["dataType", "level_type", "frequency"]).groups.keys())
 ```

 %% Output

    [('an', 'model_level', 'hourly'),
     ('an', 'pressure_level', 'daily'),
     ('an', 'pressure_level', 'hourly'),
     ('an', 'surface', 'daily'),
     ('an', 'surface', 'hourly'),
     ('an', 'surface', 'monthly'),
     ('fc', 'surface', 'daily'),
     ('fc', 'surface', 'hourly'),
     ('fc', 'surface', 'monthly')]

 %% Cell type:markdown id: tags:

 ### Browse through the ERA5 collection

 We can **search** through the intake collection by using its `search` function. E.g., we can search for ERA5 data on *pressure_level* in *hourly* frequency by:

 %% Cell type:code id: tags:

 ``` python
 query=dict(level_type="pressure_level",
           frequency="hourly")
 cat=col.search(**query)
 ```

 %% Cell type:markdown id: tags:

 The variable `cat` is a new *sub*-catalog i.e. a subset of the original catalog.<br>To see the variables contained in this sub-catalog, we print what unique variable *long names* exists :

 %% Cell type:code id: tags:

 ``` python
 cat.unique("long_name")
 ```

 %% Output

    {'long_name': {'count': 16,
      'values': ['Vorticity (relative)',
       'V component of wind',
       'Ozone mass mixing ratio',
       'U component of wind',
       'Relative humidity',
       'Specific humidity',
       'Fraction of cloud cover',
       'Specific rain water content',
       'Specific cloud ice water content',
       'Potential vorticity',
       'Specific cloud liquid water content',
       'Temperature',
       'Divergence',
       'Vertical velocity',
       'Geopotential',
       'Specific snow water content']}}

 %% Cell type:markdown id: tags:

 We can select a specific variable by another `search`, e.g. for *Temperature*.<br>We can also subset the temporal coverage that we are interested in. intake allows using **wildcards** in the search.<br>In the sub-catalog of hourly pressure level data, we can search e.g. for temperature data that are valid for January 1980 using:

 %% Cell type:code id: tags:

 ``` python
 temp_hourly_pl=cat.search(long_name="Temperature",
                         validation_date="1980-01.*")
 ```

 %% Cell type:markdown id: tags:

 We print the variable's short name:

 %% Cell type:code id: tags:

 ``` python
 temp_hourly_pl.unique("short_name")
 ```

 %% Output

    {'short_name': {'count': 1, 'values': ['t']}}

 %% Cell type:markdown id: tags:

 <a class="anchor" id="access"></a>

 ## Open multiple ERA5 files as `xarray` datasets

 We can open the *entire* selection at once with `to_dataset_dict`. The result will be a `dict`ionary of `xarray` datasets.
 For this, we have to specify a configuration for `xarray` via the `cdf_kwargs` argument:
 ```python
 cdf_kwargs={"engine":"cfgrib",
            "chunks":{
                "time":1
            }
 }
 ```
 While the *engine* indicates what *backend* `xarray` has to use to open the files (*here: cfgrib since the ERA5 data are stored in GRIB format*), we specify `chunks` so that `dask` is used for array handling. This approach **saves memory** and returns *futures* of arrays which are only computed and loaded if needed.<br>This may take a while. We can ignore warnings printed by the underlying `cfgrib` library.

 %% Cell type:code id: tags:

 ``` python
 temp_hourly_pl_xr_dict=temp_hourly_pl.to_dataset_dict(cdf_kwargs={"engine":"cfgrib",
                                          "chunks":{"time":1}
                                          }
                              )
 temp_hourly_pl_xr_dict
 ```

 %% Output

    
    --> The keys in the returned dictionary of datasets are constructed as follows:
    	'table_id.stepType.level_type.frequency'



    {'128.0.instant.pressure_level.hourly': <xarray.Dataset>
     Dimensions:        (time: 24, isobaricInhPa: 37, values: 542080)
     Coordinates:
         number         int64 ...
       * time           (time) datetime64[ns] 1980-01-31 ... 1980-01-31T23:00:00
         step           timedelta64[ns] ...
       * isobaricInhPa  (isobaricInhPa) int64 1000 975 950 925 900 875 ... 7 5 3 2 1
         latitude       (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
         longitude      (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
         valid_time     (time) datetime64[ns] dask.array<chunksize=(1,), meta=np.ndarray>
     Dimensions without coordinates: values
     Data variables:
         t              (time, isobaricInhPa, values) float32 dask.array<chunksize=(1, 37, 542080), meta=np.ndarray>
     Attributes:
         GRIB_edition:            1
         GRIB_centre:             ecmf
         GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
         GRIB_subCentre:          0
         Conventions:             CF-1.7
         institution:             European Centre for Medium-Range Weather Forecasts
         history:                 2022-06-21T18:04:52 GRIB to CDM+CF via cfgrib-0....
         intake_esm_varname:      130
         intake_esm_dataset_key:  128.0.instant.pressure_level.hourly}

 %% Cell type:markdown id: tags:

 <br>The dictionary *temp_hourly_pl_xr_dict* has exactly one entry because *all files* of the sub-catalog temp_hourly_pl have been merged along the time axis. The default configurations that control operations on the sub-catalog can be parsed as follows:

 %% Cell type:code id: tags:

 ``` python
 temp_hourly_pl.esmcol_data["aggregation_control"]
 ```

 %% Output

    {'aggregations': [{'attribute_name': 'code', 'type': 'union'}],
     'variable_column_name': 'code',
     'groupby_attrs': ['table_id', 'stepType', 'level_type', 'frequency']}

 %% Cell type:markdown id: tags:

 <br>Now, let's get our dataset and have a look. We extract the last (and only) entry from `temp_hourly_pl_xr_dict` using the `popitem` method. `popitem` returns a tuple of size 2. The first tuple (index 0) contains the key '128.0.instant.pressure_level.hourly', the second tuple (index 1) contains the dataset:

 %% Cell type:code id: tags:

 ``` python
 temp_hourly_pl_xr_dset=temp_hourly_pl_xr_dict.popitem()[1]
 ```

 %% Cell type:markdown id: tags:

 Please note that once the method `popitem` is applied upon temp_hourly_pl_xr_dict, it return an empty temp_hourly_pl_xr_dict dictionary.

 %% Cell type:markdown id: tags:

 *temp_hourly_pl_xr_dset* is an `xarray` dataset. We can take advantage of `xarray` techniques to e.g.
 - **select** the 500hPa level and
 - **calculate** the mean for the selected month January

 %% Cell type:code id: tags:

 ``` python
 t500mean=temp_hourly_pl_xr_dset.sel(isobaricInhPa=500.,
                           method="nearest").mean(dim="time")
 ```

 %% Cell type:markdown id: tags:

 How does the new xarray dataset t500mean look like?

 %% Cell type:code id: tags:

 ``` python
 t500mean
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (values: 542080)
    Coordinates:
        number         int64 ...
        step           timedelta64[ns] ...
        isobaricInhPa  int64 500
        latitude       (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
        longitude      (values) float64 dask.array<chunksize=(542080,), meta=np.ndarray>
    Dimensions without coordinates: values
    Data variables:
        t              (values) float32 dask.array<chunksize=(542080,), meta=np.ndarray>

 %% Cell type:markdown id: tags:

 We see that the values of the data variable t are given as dask.array.<br>
 Using `compute`, we manually trigger loading the data of this dataset into memory and to return a new dataset.

 %% Cell type:code id: tags:

 ``` python
 t500mean.compute()
 ```

 %% Output

    <xarray.Dataset>
    Dimensions:        (values: 542080)
    Coordinates:
        number         int64 0
        step           timedelta64[ns] 00:00:00
        isobaricInhPa  int64 500
        latitude       (values) float64 89.78 89.78 89.78 ... -89.78 -89.78 -89.78
        longitude      (values) float64 0.0 20.0 40.0 60.0 ... 300.0 320.0 340.0
    Dimensions without coordinates: values
    Data variables:
        t              (values) float32 236.8 236.7 236.6 ... 237.5 237.4 237.4

 %% Cell type:markdown id: tags:

 Plotting the data with the `plot` function shows the zonal gradient of 500 hPa temperature (in K) in January 1980. The x-axis is a proxy for the latitude (North->South direction). The figure reflects mid-atmosphere temperature (500 hPa) strongly increases from the poles towards the Equator.

 %% Cell type:code id: tags:

 ``` python
 t500mean.t.plot()
 ```

 %% Output

    [<matplotlib.lines.Line2D at 0x7fff53d7a610>]