Commit daaa1fcf authored by Fabian Wachsmann's avatar Fabian Wachsmann
Browse files

Delete tutorial_find-data_catalogs_intake-esm.ipynb

parent e552e581
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Tutorial on how to find data with a catalog and INTAKE-ESM\n",
"\n",
"In this tutorial, we load a CMIP6 catalog which contains all data from the pool on DKRZ's mistral disk storage.\n",
"CMIP6 is the 6th phase of the Coupled Model Intercomparison Project and builds the data base used in the IPCC AR6.\n",
"The CMIP6 catalog contains all data that is published or replicated at the ESGF node at DKRZ.\n",
"\n",
"- Intake-esm tutorial: https://intake-esm.readthedocs.io/en/latest/notebooks/tutorial.html\n",
"\n",
"For information about CMIP6, we recommend to read \n",
"https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Preparation\n",
"First of all, we need to import the required packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import intake\n",
"import requests"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The catalog files are uploaded to the cloud. There is a gsv.gz file and a json file under\n",
"https://swiftbrowser.dkrz.de/public/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/ \n",
"We load the catalog descriptor with intake:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col_url = \"https://swift.dkrz.de/v1/dkrz_a44962e3ba914c309a7421573a6949a6/intake-esm/mistral-cmip6.json\"\n",
"col = intake.open_esm_datastore(col_url)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The descriptor looks like that:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"requests.get(col_url).json()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It contains information about the interpretation of the columns of the catalog.\n",
"The most important attribute is the link to actual data which is in the csv.gz file under attribute \"catalog_file\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Orientation\n",
"Let's see what is in the intake catalog. The underlying data base is given as a panda dataframe which we can access with 'col.df'. col.df.head() shows us the first rows of the table:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"col.df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each row is one file. Let us grab one and use the oldschool system command \"ncdump -h\" to show the header of the file which contains meta data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"testfile = col.df['path'][0]\n",
"#Commands that start with '!' are system commands that run in a shell.\n",
"#Python variables can be parsed into that system command by wrapping them in brackets '{}'.\n",
"!ncdump -h {testfile}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The entire catalog contains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"collen = len(col.df)\n",
"print(collen)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Advanced: To find out only datasets, we can use pandas functions (drop columns that are irrelevant for a dataset, drop the duplicates, keep one):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cat = col.df.drop(['path','time_range','dcpp_init_year'],1).drop_duplicates(keep=\"first\")\n",
"print(len(cat))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Browse"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are different ways to browse through the catalog. One is provided by the intake package.\n",
"We define a query and specify values for some columns.\n",
"In the following case, we look for temperature at surface in monthly resolution for 3 different experiments:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = dict(\n",
" variable_id=\"tas\",\n",
" table_id=\"Amon\",\n",
" experiment_id=[\"piControl\", \"historical\", \"ssp370\"])\n",
"# piControl = pre-industrial control, simulation to represent a stable climate from 1850 for >100 years.\n",
"# historical = historical Simulation, 1850-2014\n",
"# ssp370 = Shared Socioeconomic Pathways (SSPs) are scenarios of projected socioeconomic global changes. Simulation covers 2015-2100\n",
"cat = col.search(**query)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can find out which models have submitted data for at least one of them by:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cat.unique([\"source_id\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we instead look for the models that have submitted data for ALL experiments, we use the 'require_all_on' attribute:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cat = col.search(require_all_on=[\"source_id\"], **query)\n",
"cat.unique([\"source_id\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that only the combination of a variable_id and a table_id is unique in CMIP6. If you search for tas in all tables, you will find many entries more:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"query = dict(\n",
" variable_id=\"tas\",\n",
"# table_id=\"Amon\",\n",
" experiment_id=[\"piControl\", \"historical\", \"ssp370\"])\n",
"cat = col.search(**query)\n",
"cat.unique([\"table_id\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also select sub dataframes with pandas functions, for example, all data submitted for activity_id CMIP:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cat = col.df[col.df[\"activity_id\"] == \"CMIP\"]\n",
"cat.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Be careful when you search for specific time slices. Each frequency is connected with a individual name template for the filename. If the data is yearly, you have YYYY-YYYY whereas you have YYYYMM-YYYYMM for monthly data. "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 unstable (using the module python3/unstable)",
"language": "python",
"name": "python3_unstable"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.8"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment