Commit 95476470 authored by Marco Kulüke's avatar Marco Kulüke
Browse files

Merge branch 'dev_maria' into 'master'

Extract path of selected year without creating and masking boolean column, and some rewording

See merge request mipdata/tutorials-and-use-cases!3
parents 2514c01b 3405f4fb
......@@ -6,9 +6,9 @@
"source": [
"# Calculate a climate index in a server hosting all the climate model data \n",
"\n",
"We will show here how to count the annual summer days for a particular geolocation of your choice using the results of a climate model, in particular, the historical and shared socioeconomic pathway (ssp) experiments of the Coupled Model Intercomparison Project [CMIP6](https://pcmdi.llnl.gov/CMIP6/).\n",
"We will show here how to count the annual summer days for a particular geolocation of your choice using the results of a climate model, in particular, we can chose one of the historical or one of the shared socioeconomic pathway (ssp) experiments of the Coupled Model Intercomparison Project [CMIP6](https://pcmdi.llnl.gov/CMIP6/).\n",
"\n",
"This Jupyter notebook runs in the Jupyterhub server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/) which is an [ESGF](https://esgf.llnl.gov/) repository that hosts and maintains 4 Petabytes of CMIP6 data. Please, choose the ... kernel on the right uper corner of this notebook.\n",
"This Jupyter notebook is meant to run in the Jupyterhub server of the German Climate Computing Center [DKRZ](https://www.dkrz.de/) which is an [ESGF](https://esgf.llnl.gov/) repository that hosts 4 petabytes of CMIP6 data. Please, choose the Python 3 unstable kernel on the Kernel tab above, it contains all the packages we need here. Running this Jupyter notebook in the DKRZ server out of the Jupyterhub will entail that you create the environment accounting for the required package dependencies. Running this Jupyter notebook in your premise will also require that you install the necessary packages on you own but it will anyway fail because you will not have direct access to the data pool.\n",
"\n",
"Thanks to the data and computer scientists Marco Kulüke, Fabian Wachsmann, Regina Kwee-Hinzmann, Caroline Arnold, Felix Stiehler, Maria Moreno, and Stephan Kindermann at DKRZ for their contribution to this notebook."
]
......@@ -17,15 +17,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"In this Use Case you will learn the following:\n",
"In this use case you will learn the following:\n",
"- How to access a dataset from the DKRZ CMIP6 model data archive\n",
"- How to count the annual number of summer days for a particular geolocation using this model dataset\n",
"- How to visualize the results\n",
"\n",
"\\\n",
"\n",
"You will use:\n",
"- [Intake](https://github.com/intake/intake) for finding the data in the DKRZ catalog\n",
"- [Xarray](http://xarray.pydata.org/en/stable/) for loading and processing the data in the DKRZ Jupyterhub server\n",
"- [Intake](https://github.com/intake/intake) for finding the data in the catalog of the DKRZ archive\n",
"- [Xarray](http://xarray.pydata.org/en/stable/) for loading and processing the data\n",
"- [hvPlot](https://hvplot.holoviz.org/index.html) for visualizing the data in the Jupyter notebook and save the plots in your local computer"
]
},
......@@ -42,14 +42,14 @@
"metadata": {},
"outputs": [],
"source": [
"import intake # a general interface for loading data from an existing catalog\n",
"#import folium # visualization tool\n",
"import xarray as xr # handling labelled multi-dimensional arrays\n",
"from ipywidgets import widgets # to use widgets in the Jupyer Notebook\n",
"#from geopy.geocoders import Nominatim # Python client for several popular geocoding web services\n",
"import numpy as np # fundamental package for scientific computing\n",
"import pandas as pd # data analysis and manipulation tool\n",
"#import hvplot.pandas # visualization tool"
"import numpy as np # fundamental package for scientific computing\n",
"import pandas as pd # data analysis and manipulation tool\n",
"import xarray as xr # handling labelled multi-dimensional arrays\n",
"import intake # to find data in a catalog, this notebook explains how it works\n",
"from ipywidgets import widgets # to use widgets in the Jupyer Notebook\n",
"from geopy.geocoders import Nominatim # Python client for several popular geocoding web services\n",
"import folium # visualization tool for maps\n",
"import hvplot.pandas # visualization tool for interactive plots"
]
},
{
......@@ -67,9 +67,11 @@
"metadata": {},
"outputs": [],
"source": [
"# Produce Widgets\n",
"# Produce the widget where we can select what experiment we are interested on \n",
"\n",
"experiments = {'historical':range(1850, 2015), 'ssp585':range(2015, 2101), 'ssp126':range(2015, 2101), 'ssp245':range(2015, 2101), 'ssp119':range(2015, 2101), 'ssp434':range(2015, 2101), 'ssp460':range(2015, 2101)}\n",
"experiments = {'historical':range(1850, 2015), 'ssp585':range(2015, 2101), 'ssp126':range(2015, 2101), \n",
" 'ssp245':range(2015, 2101), 'ssp119':range(2015, 2101), 'ssp434':range(2015, 2101), \n",
" 'ssp460':range(2015, 2101)}\n",
"experiment_box = widgets.Dropdown(options=experiments, description=\"Select experiment: \", disabled=False,)\n",
"display(experiment_box)"
]
......@@ -80,6 +82,8 @@
"metadata": {},
"outputs": [],
"source": [
"# Produce the widget where we can select what geolocation and year are interested on \n",
"\n",
"place_box = widgets.Text(description=\"Enter place:\")\n",
"display(place_box)\n",
"\n",
......@@ -93,9 +97,7 @@
"metadata": {},
"source": [
"### 1.1 Find Coordinates of chosen Place\n",
"If ambiguous, the most likely coordinates will be chose\n",
"\\\n",
"e.g. \"Hamburg\" results in \"Hamburg, 20095, Deutschland\", (53.55 North, 10.00 East)"
"If ambiguous, the most likely coordinates will be chosen, e.g. \"Hamburg\" results in \"Hamburg, 20095, Deutschland\", (53.55 North, 10.00 East)"
]
},
{
......@@ -104,6 +106,8 @@
"metadata": {},
"outputs": [],
"source": [
"# We use the module Nominatim gives us the geographical coordinates of the place we selected above\n",
"\n",
"geolocator = Nominatim(user_agent=\"any_agent\")\n",
"location = geolocator.geocode(place_box.value)\n",
"\n",
......@@ -124,6 +128,8 @@
"metadata": {},
"outputs": [],
"source": [
"# We use the folium package to plot our selected geolocation in a map\n",
"\n",
"m = folium.Map(location=[location.latitude, location.longitude])\n",
"tooltip = location.latitude, location.longitude\n",
"folium.Marker([location.latitude, location.longitude], tooltip=tooltip).add_to(m)\n",
......@@ -142,10 +148,10 @@
"metadata": {},
"source": [
"## 2. Intake Catalog\n",
"Similar to the shopping catalog at your favorite online bookstore, the intake catalog contains information (e.g. model, variables, and time range) about each dataset (the title, author, and number of pages of the book, for instance) that you can access before loading the data (so thanks to the catalog, you do not need to open the book to know the number of pages of the book, for instance).\n",
"Similar to the shopping catalog at your favorite online bookstore, the intake catalog contains information (e.g. model, variables, and time range) about each dataset (the title, author, and number of pages of the book, for instance) that you can access before loading the data. It means that thanks to the catalog, you can find where is the book just by using some keywords and you do not need to hold it in your hand to know the number of pages, for instance.\n",
"\n",
"### 2.1 Load the Intake Catalog\n",
"We load the catalog descriptor with the intake package. The catalog is updated daily."
"We load the catalog descriptor with the intake package. The catalog is updated daily. The catalog descriptor is created by the DKRZ developers that manage the catalog, you do not need to care so much about it, knowing where it is and loading it is enough:"
]
},
{
......@@ -157,7 +163,7 @@
"# Path to catalog descriptor on the DKRZ server\n",
"col_url = \"/work/ik1017/Catalogs/mistral-cmip6.json\"\n",
"\n",
"# Open the catalog with the intake package and name it \"col\" as short for collection\n",
"# Open the catalog with the intake package and name it \"col\" as short for \"collection\"\n",
"col = intake.open_esm_datastore(col_url)"
]
},
......@@ -180,9 +186,7 @@
"metadata": {},
"source": [
"### 2.2 Browse the Intake Catalog\n",
"In this example we chose the Max-Planck Earth System Model in High Resolution Mode (\"MPI-ESM1-2-HR\") and the maximum temperature near surface (\"tasmax\") as variable.\n",
"\\\n",
"CMIP6 comprises several kind of experiments. Each experiment has various simulation members. More information can be found via the [CMIP6 Model and Experiment Documentation](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html#5-model-and-experiment-documentation)."
"In this example we chose the Max-Planck Earth System Model in High Resolution Mode (\"MPI-ESM1-2-HR\") and the maximum temperature near surface (\"tasmax\") as variable. We also choose an experiment. CMIP6 comprises several kind of experiments. Each experiment has various simulation members. you can find more information in the [CMIP6 Model and Experiment Documentation](https://pcmdi.llnl.gov/CMIP6/Guide/dataUsers.html#5-model-and-experiment-documentation)."
]
},
{
......@@ -191,22 +195,24 @@
"metadata": {},
"outputs": [],
"source": [
"# store the name of the model we chose in a variable named \"climate_model\"\n",
"# Store the name of the model we chose in a variable named \"climate_model\"\n",
"\n",
"climate_model = \"MPI-ESM1-2-LR\" # here we choose Max-Plack Institute's Earth Sytem Model in high resolution\n",
"\n",
"# this is how we tell intake what data we want\n",
"# This is how we tell intake what data we want\n",
"\n",
"query = dict(\n",
" source_id = climate_model, # the model \n",
" variable_id = \"tasmax\", # temperature at surface, maximum\n",
" table_id = \"day\", # daily maximum\n",
" experiment_id = experiment_box.label, # what we selected in the drop down menu,for instance, historical 850-2014\n",
" member_id = \"r10i1p1f1\", # \"r\" realization, \"i\" initialization, \"p\" physics, \"f\" forcing\n",
" source_id = climate_model, # the model \n",
" variable_id = \"tasmax\", # temperature at surface, maximum\n",
" table_id = \"day\", # daily maximum\n",
" experiment_id = experiment_box.label, # what we selected in the drop down menu,e.g. SSP2.4-5 2015-2100\n",
" member_id = \"r10i1p1f1\", # \"r\" realization, \"i\" initialization, \"p\" physics, \"f\" forcing\n",
")\n",
"\n",
"# intake looks for the query we just defined in the catalog of the CMIP6 data pool at DKRZ\n",
"# Intake looks for the query we just defined in the catalog of the CMIP6 data pool at DKRZ\n",
"cat = col.search(**query)\n",
"\n",
"# show query results\n",
"# Show query results\n",
"cat.df"
]
},
......@@ -214,7 +220,10 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The result of the query are like the list of results you get when you search for articles in the internet by writing keywords in your search engine (duck duck go, ecosia, google,...). Thanks to intake, we did not need to know the path of each dataset, just selecting some keywords (the model name, the variable,...) was enough to obtain the results. If advance users are still interested in the location of the data inside the DKRZ archive, intake also provides the path and the OpenDAP URL (see the last columns above). Now we will find which file in the dataset contains our selected year so in the next section we can just load that specific file."
"The result of the query are like the list of results you get when you search for articles in the internet by writing keywords in your search engine (Duck duck go, Ecosia, Google,...). Thanks to the intake package, we did not need to know the path of each dataset, just selecting some keywords (the model name, the variable,...) was enough to obtain the results. If advance users are still interested in the location of the data inside the DKRZ archive, intake also provides the path and the OpenDAP URL (see the last columns above). \n",
"\n",
"\n",
"Now we will find which file in the dataset contains our selected year so in the next section we can just load that specific file and not the whole dataset."
]
},
{
......@@ -230,17 +239,16 @@
"metadata": {},
"outputs": [],
"source": [
"# copying the cat.df dataframe to a new dataframe, thus further modifications do not affect the original cat.df \n",
"# Create a copy of cat.df, thus further modifications do not affect it \n",
"query_result_df = cat.df.copy() # new dataframe to play with\n",
"\n",
"# each dataset contains many files, extract the initial and final year of each file \n",
"# Each dataset contains many files, extract the initial and final year of each file \n",
"query_result_df[\"start_year\"] = query_result_df[\"time_range\"].str[0:4].astype(int) # add column with start year\n",
"query_result_df[\"end_year\"] = query_result_df[\"time_range\"].str[9:13].astype(int) # add column with end year\n",
"\n",
"# delete the time range column\n",
"query_result_df.drop(columns=[\"time_range\"], inplace = True) # if \"inplace\" is False, .drop() creates a new df\n",
"\n",
"query_result_df.head()"
"# Delete the time range column\n",
"query_result_df.drop(columns=[\"time_range\"], inplace = True) # \"inplace = False\" will drop the column in the view but not in the actual dataframe\n",
"query_result_df.iloc[0:3]"
]
},
{
......@@ -249,20 +257,14 @@
"metadata": {},
"outputs": [],
"source": [
"# create a column labelling the year selection as True or False\n",
"# TO DO: is there a non boolean way that is better? smth like query_result_df[query_result_df['star_year'] == year_box_value]?\n",
"query_result_df[\"selection\"] = (year_box.value >= query_result_df[\"start_year\"]) & (\n",
" year_box.value <= query_result_df[\"end_year\"]\n",
")\n",
"\n",
"selected_path_index = query_result_df.loc[query_result_df[\"selection\"] == True][\n",
" \"path\"\n",
"].index[0]\n",
"# Select the file that contains the year we selected in the drop down menu above, e.g. 2015\n",
"selected_file = query_result_df_m[(year_box.value >= query_result_df[\"start_year\"]) & (\n",
" year_box.value <= query_result_df[\"end_year\"])]\n",
"\n",
"# select the rows with True in the column \"selection\"\n",
"selected_path = query_result_df[\"path\"][selected_path_index]\n",
"# Path of the file that contains the selected year \n",
"selected_path = selected_file[\"path\"].values[0] \n",
"\n",
"# show path for selected year\n",
"# Show the path of the file that contains the selected year\n",
"selected_path"
]
},
......@@ -398,7 +400,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"The definition of a summer day varies from region to region. According to the [German Weather Service](https://www.dwd.de/EN/ourservices/germanclimateatlas/explanations/elements/_functions/faqkarussel/sommertage.html), \"a summer day is a day on which the maximum air temperature is at least 25.0°C\". Depending on the place you selected, you might want to apply a different threshold. "
"The definition of a summer day varies from region to region. According to the [German Weather Service](https://www.dwd.de/EN/ourservices/germanclimateatlas/explanations/elements/_functions/faqkarussel/sommertage.html), \"a summer day is a day on which the maximum air temperature is at least 25.0°C\". Depending on the place you selected, you might want to apply a different threshold to calculate the summer days index. "
]
},
{
......@@ -408,14 +410,17 @@
"outputs": [],
"source": [
"tasmax_year_place_xr = tasmax_year_xr[:, yloc, xloc] - 273.15 # Convert Kelvin to °C\n",
"tasmax_year_place_df = pd.DataFrame(index = tasmax_year_place_xr['time'].values, columns = ['Temperature', 'Summer Day Threshold']) # Create Pandas Series\n",
"tasmax_year_place_df = pd.DataFrame(index = tasmax_year_place_xr['time'].values, \n",
" columns = ['Temperature', 'Summer Day Threshold']) # create the dataframe\n",
"\n",
"tasmax_year_place_df.loc[:, 'Model Temperature'] = tasmax_year_place_xr.values # Insert model data into Pandas Series\n",
"tasmax_year_place_df.loc[:, 'Summer Day Threshold'] = 25 # Insert threshold into Pandas series\n",
"tasmax_year_place_df.loc[:, 'Model Temperature'] = tasmax_year_place_xr.values # insert model data into the dataframe\n",
"tasmax_year_place_df.loc[:, 'Summer Day Threshold'] = 25 # insert the threshold into the dataframe\n",
"\n",
"# Plot data and define title and legend\n",
"tasmax_year_place_df.hvplot.line(y=['Model Temperature', 'Summer Day Threshold'], \n",
" value_label='Temperature in °C', legend='bottom', title='Daily maximum Temperature near Surface for ' +place_box.value, height=500, width=620)"
" value_label='Temperature in °C', legend='bottom', \n",
" title='Daily maximum Temperature near Surface for '+place_box.value, \n",
" height=500, width=620)"
]
},
{
......@@ -435,7 +440,9 @@
"no_summer_days_model = tasmax_year_place_xr[tasmax_year_place_xr > 25].size # count the number of summer days\n",
"\n",
"# Print results in a sentence\n",
"print(\"According to the German Weather Service definition, in the \" +experiment_box.label +\" experiment the \" +climate_model +\" model shows \" +str(no_summer_days_model) +\" summer days for \" +str(place_box.value) + \" in \" + str(year_box.value) +\".\")"
"print(\"According to the German Weather Service definition, in the \" +experiment_box.label +\" experiment the \" \n",
" +climate_model +\" model shows \" +str(no_summer_days_model) +\" summer days for \" +str(place_box.value) \n",
" + \" in \" + str(year_box.value) +\".\")"
]
},
{
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment