Commit 68a2f1ca authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

added era notebook

parent 67d09002
Pipeline #19336 passed with stage
in 7 minutes and 13 seconds
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Intake I part 3 - DKRZ Catalogs: ERA5 data\n",
"\n",
"DKRZ intake catalogs cover different projects. This notebook describes ERA5 and the catalog for its data."
]
},
{
"cell_type": "markdown",
"metadata": {
"tags": []
},
"source": [
"```{admonition} Overview\n",
":class: dropdown\n",
"\n",
"![Level](https://img.shields.io/badge/Level-Introductory-green.svg)\n",
"\n",
"\n",
"🎯 **objectives**: Get to know the ERA5 collection\n",
"\n",
"⌛ **time_estimation**: \"15min\"\n",
"\n",
"☑️ **requirements**: `intake_esm.__version__ >= 2021.8.17`, at least 5GB memory.\n",
"\n",
"© **contributors**: k204210\n",
"\n",
"⚖ **license**:\n",
"\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{admonition} Agenda\n",
":class: tip\n",
"\n",
"In this part, you learn\n",
"\n",
"1. [what ERA5 is](#intro)\n",
"1. [how to find the collection](#find)\n",
"1. [browsing through the ERA5 collection](#browse)\n",
"1. [how to load ERA5 data with intake-esm](#access)\n",
" \n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a class=\"anchor\" id=\"intro\"></a>\n",
"\n",
"## ERA5, its features and use cases\n",
"\n",
"ERA ('ECMWF Re-Analysis') refers to a series of climate reanalysis datasets produced at the [European Centre for Medium-Range Weather Forecasts](http://www.ecmwf.int). Climate reanalyses combine observations with models to generatÏe consistent time series of multiple climate variables. [ERA5 (ERA fifth generation)](https://www.ecmwf.int/en/forecasts/datasets/reanalysis-datasets/era5) is the latest climate reanalysis which is produced by Copernicus Climate Change Service (C3S) at ECMWF. It replaces ERA-Interim and other [predecessor ERA datasets](https://confluence.ecmwf.int/display/CKB/The+family+of+ERA5+datasets?src=contextnavpagetreemode) such as, e.g., ERA-40, ERA-15 and ERA-20C.\n",
"\n",
"Contracted by the [German Meteorological Service](https://www.dwd.de/DE/Home/home_node.html), the World Data Centre for Climate (WDCC) at DKRZ is the German distributor of a [selection of these data](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html).\n",
"\n",
"> ERA5 is a global comprehensive reanalysis, from 1979 to near real time. The period 1959 to 1979 was only recently released and is currently being transferred to DKRZ. "
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"### Features\n",
"\n",
"- Spatial resolution is about **31 km** globally\n",
"- Dependent on the parameter, the data are stored on a **reduced Gaussian Grid (N320)** <br> or as **spectral coefficients** (with a triangular truncation of **T639**)\n",
"- Provided on 137/37 different **model/pressure** levels\n",
"- Temporal coverage from **1979 up to today** (1959-1979 newly released) \n",
"- Temporal resolution from hourly, daily to monthly\n",
"\n",
"### Use cases\n",
"\n",
"ERA5 data have a broad range of applications, some of which are\n",
"\n",
"- forcing of (regional) climate models,\n",
"- evaluation of climate models with reanalysis,\n",
"- comparison of weather observations to data of other scientific fields."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Further information\n",
"\n",
"- [General ERA5 data documentation](https://confluence.ecmwf.int/display/CKB/ERA5:+data+documentation)\n",
"- [List of parameters/codes/definitions from the parameter database by code/table numbers](https://apps.ecmwf.int/codes/grib/param-db)\n",
"- [List of params/codes/defs from the parameter DB by parameter types, incl explanations](https://confluence.ecmwf.int/display/CKB/ERA5%3A+data+documentation#ERA5:datadocumentation-Parameterlistings)\n",
"- [Conversion table for accumulated variables (total precipitation/fluxes)](https://confluence.ecmwf.int/pages/viewpage.action?pageId=197702790)\n",
"- [ERA5 data in DKRZ's /pool/data](https://docs.dkrz.de/doc/dataservices/finding_and_accessing_data/era_data/index.html)\n",
"\n",
"Please mail to data [at] dkrz [dot] de and visit the [DKRZ Webpage](https://www.dkrz.de/up/de-services/de-data-management/de-projects_cooperations/de-era/de-era)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a class=\"anchor\" id=\"find\"></a>\n",
"\n",
"## Find and open the collection\n",
"\n",
"First of all, we need to import the required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import intake"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use intake to open the main catalog which includes all project catalogs and sub catalogs.\n",
"\n",
"`intake` **opens** catalogs for data sources given in `yaml` format. These contain information about plugins and sources required for accessing and loading the data. The command is `open_catalog`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#dkrz_catalog=intake.open_catalog([\"https://dkrz.de/s/intake\"])\n",
"#\n",
"#only for the web page we need to take the original link:\n",
"dkrz_catalog=intake.open_catalog([\"https://gitlab.dkrz.de/data-infrastructure-services/intake-esm/-/raw/master/esm-collections/cloud-access/dkrz_catalog.yaml\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use `print` and `list` to find out what the catalog contains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(dkrz_catalog)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now focus on the ERA5 collection"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col=dkrz_catalog.dkrz_era5_disk"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The variable `col` now contains the intake collection that links to DKRZ's /pool/data ERA5 database."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col.description"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we print the variable `col` to see information on the data assets properties and associated metadata (e.g. which institution the data come from)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ERA5 catalog consists of 16 datasets from about 550k assets/files."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a class=\"anchor\" id=\"browse\"></a>\n",
"\n",
"## ERA5 collection's facets\n",
"\n",
"The **ERA5 Catalog** enables to browse through the data base using **10 search facets**. We could group them into 4 categories:\n",
"\n",
"*Basic* data information:\n",
"- `era_id`: Today, only E5 is available.\n",
"- `dataType`: Two data types are available: **An**alysis data are *pure* analysis and only contain intensive data (like temperature). **F**ore**c**ast data contain extensive data (like precipitation) which are accumulated quantities.\n",
"- `uri`: Corresponds to the path on DKRZ's HPC file system."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Information on the *type of horizontal level*:\n",
"- `level_type`: Three types are available: **model_level**, **pressure_level** or **surface**\n",
"\n",
"*Temporal* information. The ERA5 database starts in January 1979 (the years until 1959 are currently being added). \n",
"- `stepType`: Is the variable accumulated, instantaneous or averaged?\n",
"- `frequency`: What is the temporal resolution of the data? The database contains hourly, daily and monthly data.\n",
"- `validation_date`: The date when the analysis is valid.\n",
"- `initialization_date`: The date when the forecast started.\n",
"\n",
"*Variable* identifier (redundant) and attributes:\n",
"- `code` : Corresponds to the GRIB code of the variable in the file.\n",
"- `table_id` : Specifies which GRIB code table associated with the Grib code. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you require more information on the variables, the catalog can be loaded with more columns. You can find out additional era5 attributes from the main catalog via:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"dkrz_catalog.metadata[\"parameters\"][\"additional_era5_columns\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can load these into the catalog by providing a keyword argument:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cols=dkrz_catalog._entries[\"dkrz_era5_disk\"]._open_args[\"csv_kwargs\"][\"usecols\"]+dkrz_catalog.metadata[\"parameters\"][\"additional_era5_columns\"][\"default\"]\n",
"col=dkrz_catalog.dkrz_era5_disk(csv_kwargs=dict(usecols=cols))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- `short_name` : A short identifier similar to the netCDF variable name.\n",
"- `long_name` : A longer description of the variable.\n",
"- `units` : The units of the variable."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can obtain more information on the individual elements by using e.g."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col.unique(\"dataType\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ERA5 database has two unique `dataTypes`: \n",
"- **fc** for forecast. All files which contain \"12\" in their name are *forecast* data.\n",
"- **an** for analysis. All files which contain \"00\" in their name are *analysis* data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"col.unique(\"frequency\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ERA5 database contains data with **hourly**, **daily** and **monthly** `frequency`."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col.unique(\"level_type\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ERA5 database contains **surface** level (sfc) data. In addition, it contains vertically resolved data at **model levels** \n",
"(ml, 137 levels) and at **pressure levels** (pl, 37 levels). "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col.unique(\"stepType\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ERA5 database covers six `stepType`s. This attribute is parsed from the GRIB attribute `GRIB_stepType`: \n",
"- 'accum'\n",
"- 'max'\n",
"- 'avgua'\n",
"- 'avgad'\n",
"- 'instant'\n",
"- 'avgid'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can check which combinations of **dataType**, **level_type** and **frequency** exist by using the`groupby` function of the underlying `dataframe`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(col.df.groupby([\"dataType\", \"level_type\", \"frequency\"]).groups.keys())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Browse through the ERA5 collection\n",
"\n",
"We can **search** through the intake collection by using its `search` function. E.g., we can search for ERA5 data on *pressure_level* in *hourly* frequency by:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query=dict(level_type=\"pressure_level\",\n",
" frequency=\"hourly\")\n",
"cat=col.search(**query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The variable `cat` is a new *sub*-catalog i.e. a subset of the original catalog.<br>To see the variables contained in this sub-catalog, we print what unique variable *long names* exists :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cat.unique(\"long_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can select a specific variable by another `search`, e.g. for *Temperature*.<br>We can also subset the temporal coverage that we are interested in. intake allows using **wildcards** in the search.<br>In the sub-catalog of hourly pressure level data, we can search e.g. for temperature data that are valid for January 1980 using:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"temp_hourly_pl=cat.search(long_name=\"Temperature\",\n",
" validation_date=\"1980-01.*\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We print the variable's short name:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"temp_hourly_pl.unique(\"short_name\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a class=\"anchor\" id=\"access\"></a>\n",
"\n",
"## Open multiple ERA5 files as `xarray` datasets\n",
"\n",
"We can open the *entire* selection at once with `to_dataset_dict`. The result will be a `dict`ionary of `xarray` datasets.\n",
"For this, we have to specify a configuration for `xarray` via the `cdf_kwargs` argument:\n",
"```python\n",
"cdf_kwargs={\"engine\":\"cfgrib\",\n",
" \"chunks\":{\n",
" \"time\":1\n",
" }\n",
"}\n",
"```\n",
"While the *engine* indicates what *backend* `xarray` has to use to open the files (*here: cfgrib since the ERA5 data are stored in GRIB format*), we specify `chunks` so that `dask` is used for array handling. This approach **saves memory** and returns *futures* of arrays which are only computed and loaded if needed.<br>This may take a while. We can ignore warnings printed by the underlying `cfgrib` library. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"temp_hourly_pl_xr_dict=temp_hourly_pl.to_dataset_dict(cdf_kwargs={\"engine\":\"cfgrib\",\n",
" \"chunks\":{\"time\":1}\n",
" } \n",
" )\n",
"temp_hourly_pl_xr_dict"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>The dictionary *temp_hourly_pl_xr_dict* has exactly one entry because *all files* of the sub-catalog temp_hourly_pl have been merged along the time axis. The default configurations that control operations on the sub-catalog can be parsed as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"temp_hourly_pl.esmcol_data[\"aggregation_control\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>Now, let's get our dataset and have a look. We extract the last (and only) entry from `temp_hourly_pl_xr_dict` using the `popitem` method. `popitem` returns a tuple of size 2. The first tuple (index 0) contains the key '128.0.instant.pressure_level.hourly', the second tuple (index 1) contains the dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"temp_hourly_pl_xr_dset=temp_hourly_pl_xr_dict.popitem()[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Please note that once the method `popitem` is applied upon temp_hourly_pl_xr_dict, it return an empty temp_hourly_pl_xr_dict dictionary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*temp_hourly_pl_xr_dset* is an `xarray` dataset. We can take advantage of `xarray` techniques to e.g.\n",
"- **select** the 500hPa level and\n",
"- **calculate** the mean for the selected month January"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t500mean=temp_hourly_pl_xr_dset.sel(isobaricInhPa=500.,\n",
" method=\"nearest\").mean(dim=\"time\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How does the new xarray dataset t500mean look like?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t500mean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that the values of the data variable t are given as dask.array.<br>\n",
"Using `compute`\u001b, we manually trigger loading the data of this dataset into memory and to return a new dataset."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t500mean.compute()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Plotting the data with the `plot` function shows the zonal gradient of 500 hPa temperature (in K) in January 1980. The x-axis is a proxy for the latitude (North->South direction). The figure reflects mid-atmosphere temperature (500 hPa) strongly increases from the poles towards the Equator."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"t500mean.t.plot()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "taucenv",
"language": "python",
"name": "taucenv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.4"
}