README.md 8.02 KB
Newer Older
antarcticrainforest's avatar
antarcticrainforest committed
1 2
# The ICON data analysis pipeline

Martin Bergemann's avatar
Martin Bergemann committed
3 4
This repository contains a collection of *python* based modules to set up an automatic pipeline for distributed data processing. The pipeline consists of different
steps for data processing (reduction), data visualisation and uploading the content to a swift cloud object storage.
Martin Bergemann's avatar
Martin Bergemann committed
5

Martin Bergemann's avatar
Martin Bergemann committed
6
---
7
## 1. Pipeline design
Martin Bergemann's avatar
Martin Bergemann committed
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

The pipeline contains of **6 Steps**  some of which are processed in parallel. Currently the following parts are implemented:

1. Checking *swift cloud storage* login credentials:
	* This step checks the validity of the login token to the swift cloud storage.
2. Cloning the front end repository:
	* The final plots will be displayed on the on the swift cloud storage with help of a java script based web interface. The repository containing java script frontend will be cloned
3. Processing of single level data.:
  	* Single level data (2d data) will be remapped, time averages applied and pre defined variables like top of  the atmosphere net radiation budgets will be calculated. 
4. Processing multi level data:
   	* Multi level dta (3d data) will be remapped, time averages and interpolation to iso-z level will be applied.
5. Creating    the visualisation:
 	* The output of steps iii and iv will be visualised.
6. Upload plots to swift cloud container.

23 24 25
### 1.2 Adjusting pipeline setup
Under the ```steps``` entry the in the *toml* configuration pipeline tasks can be adjusted. For example ```steps =  ['run_2d_reduction', 'run_3d_reduction', 'run_notebook']``` will process 2D and 3D data as well as applying the visualisation. ```steps = ['run_2d_reduction']``` on the other hand will only process 2D data. For more information on the configuration refer to the chapter **3** in this README.  

Martin Bergemann's avatar
Martin Bergemann committed
26 27 28 29 30

See also the below figure for the setup.

| ![](FlowChart.png) |
|:--:| 
Martin Bergemann's avatar
Martin Bergemann committed
31
| *Pipeline Flow Chart:* Steps of the pipeline. Red circles show steps that make used of distributed HPC resources on mistral. |
Martin Bergemann's avatar
Martin Bergemann committed
32
---
33
## 2. Installation & Deployment
antarcticrainforest's avatar
antarcticrainforest committed
34 35
There are multiple ways to install data analysis repository:

Martin Bergemann's avatar
Martin Bergemann committed
36 37 38
#### 2.1 Installing the repository with pip (preferred)
This method is the most simple solution if you do not intent to change the code of the repository, but install the pipeline directly from gitlab.dkrz.de using pip:
 ```bash
Martin Bergemann's avatar
Martin Bergemann committed
39
 $: python3 -m pip install --user git+https://gitlab.dkrz.de/m300765/da_pipeline.git
Martin Bergemann's avatar
Martin Bergemann committed
40 41 42 43 44 45 46 47 48 49
 ```
 Since it's not default on Mistral yet, make sure that you have loaded a suitable `python3` environment before involving the pip command. Something like this should do:
 ```bash
 $: module load python3
 ```
Note that the console script - that is the  `da-pipeline` command will always installed into `~/.local/bin/da-pipeline`. Hence you might want to add `~/.local/bin` to your `PATH` environment variable. If not already done add the following line to the `.profile` file in your home-directory :
```bash
$: export PATH=$PATH:$HOME/.local/bin
```
#### 2.2 Creating a new mini conda environment
Martin Bergemann's avatar
Martin Bergemann committed
50 51 52 53
To install the pipeline program via conda you'll have to clone the repository first:
```bash
$: git clone https://gitlab.dkrz.de/m300765/da_pipeline.git
```
antarcticrainforest's avatar
antarcticrainforest committed
54 55 56 57 58 59 60 61
The repository contains a ```Makefile``` that defines steps to create a new *mini conda* environment and installs the repository along with allr required libraries. To newly install the it simply type:
```bash
$: make deploy
```
this will create a new conda environment in the `env` directory of the repository. You can also choose another install location of the conda environment by setting the ```PATH_PREFIX``` environment variable, for example:
```bash
$: PATH_PREFIX=$HOME/da_pipeline make deploy
```
Martin Bergemann's avatar
Martin Bergemann committed
62
will create a new conda environment in `$HOME/da_pipeline/env`. 
antarcticrainforest's avatar
antarcticrainforest committed
63

Martin Bergemann's avatar
Martin Bergemann committed
64
--- 
Martin Bergemann's avatar
Martin Bergemann committed
65
## 3. Setting up running the pipeline
antarcticrainforest's avatar
antarcticrainforest committed
66 67 68 69 70 71 72 73

The pipeline is configured with help of a `toml` configuration file. Here the `toml` format was chosen over more widely used formats like `json` or `yaml` mainly due to its simplicity and readability. For more information on `toml` visit : https://toml.io . There are four main parts in the configuration:

* General: configuration that is of general nature
* Reduction: configuration for the data processing step of the pipeline - steps iii and iv of the pipeline.
* Visualisation: configuration for the data visualisation
* Swift : configuration to upload the data visualisation content to the swift object storage

Martin Bergemann's avatar
Martin Bergemann committed
74
An example configuration can be found in [config.toml](https://gitlab.dkrz.de/m300765/da_pipeline/-/blob/master/config.toml) in the main repository. The example configuration file can either be found the in repository or can directly downloaded following this [link to the dkrz gitlab](https://gitlab.dkrz.de/m300765/da_pipeline/-/raw/master/config.toml?inline=false) 
antarcticrainforest's avatar
antarcticrainforest committed
75 76 77 78

The pipeline can be applied using the ```da-pipeline``` command:
```bash
$: da-pipeline --help
79
usage: run_pipeline [-h] [-a ADDRESS ADDRESS ADDRESS] [-nb NOTE_BOOK] [--begin BEGIN] [--no_slurm] config
antarcticrainforest's avatar
antarcticrainforest committed
80 81 82 83 84 85 86 87 88

Apply data pipeline.

positional arguments:
  config                Toml file containing the pipeline configuration

optional arguments:
  -h, --help            show this help message and exit
  -a ADDRESS ADDRESS ADDRESS, --address ADDRESS ADDRESS ADDRESS
89
                        Give a tcp address to a distributed scheduler (default: (None, None, None))
Martin Bergemann's avatar
Martin Bergemann committed
90
  -nb NOTE_BOOK, --note-book NOTE_BOOK, --notebook NOTE_BOOK
91 92 93 94
                        Chose notebook for visualization (default: ~/workspace/da_pipeline/da_pipeline/scripts/PlotData.ipynb)
  --begin BEGIN         Choose begin time of the job, only applicable with slurm jobs (default: now)
  --no_slurm, --no-slurm, --noslurm
                        Do not submit the pipeline jobs via slurm, rather start a background job (default: False)
Martin Bergemann's avatar
Martin Bergemann committed
95

antarcticrainforest's avatar
antarcticrainforest committed
96 97
```

Martin Bergemann's avatar
Martin Bergemann committed
98
> **_Note:_**  The pipeline will create three scheduler clients for distributed data processing. You can pre-setup this schedulers using `dask distributed` and inform the pipeline to use this schedulers rather than creating a new scheduler. This can be useful for debugging. Read more on dask distributed schedulers here: https://jobqueue.dask.org/en/latest/index.html .
antarcticrainforest's avatar
antarcticrainforest committed
99

Martin Bergemann's avatar
Martin Bergemann committed
100
In most cases it is enough to configure the  `config.toml` file and start the pipeline:
antarcticrainforest's avatar
antarcticrainforest committed
101 102 103 104 105 106
```bash
$: da-pipeline config.toml
Pipeline was set up in the back ground,
check output in /home/mpim/m300765/da_pipeline/output/dpp0014_pipeline.out

```
Martin Bergemann's avatar
Martin Bergemann committed
107
You  might be prompted that you have to create a new token for the swift cloud storage. In this case just type your user password for the displayed dkrz account.
antarcticrainforest's avatar
antarcticrainforest committed
108 109 110 111 112 113

The pipeline process itself will be executed in the background. You can observe the pipeline status which is displayed in the output file,  for example:
```bash
tail -f /home/mpim/m300765/da_pipeline/output/dpp0014_pipeline.out
```
Since the process is running in the background, you can log out and check the status later. After successfully running the pipeline you can inspect the output of the pipeline on the swift browser. The url will be displayed in the output file (/home/mpim/m300765/da_pipeline/output/dpp0014_pipeline.out in the above example). 
Martin Bergemann's avatar
Martin Bergemann committed
114

Martin Bergemann's avatar
Martin Bergemann committed
115
---
116
## 4. Adjusting the visualisation
Martin Bergemann's avatar
Martin Bergemann committed
117
 The visulisation is defined by a [jupyter notebook](http://jupyter.org), which is applied in batch mode using ```papermill```.  The original notebook k is located in [da_pipeline/scripts/PlotData.ipynb](https://gitlab.dkrz.de/m300765/da_pipeline/-/blob/master/da_pipeline/scripts/PlotData.ipynb) . You can adjust this notebook to change plotting style, add new plots etc. You'll always be able enter the notbook, that has been auto-generated by papermill and run parts of it later.
Martin Bergemann's avatar
Martin Bergemann committed
118
 
Martin Bergemann's avatar
Martin Bergemann committed
119
 > **_Note:_** The [papermill library](https://papermill.readthedocs.io/en/latest/usage-parameterize.html) that applies the notebook in batch mode will look for notebook cells tagged with ```injected-parameters``` and replace the ```sim_config_file``` and ```scheduler_address``` variables. Hence if you're planning to change the notebook or run a whole new one you should design its content that it'll be configured with this two parameters. Also the output filenames of the visualisation files should comply with what is expected by the java-script frontend (https://gitlab.dkrz.de/m300765/da_pipeline_frontend).
Martin Bergemann's avatar
Martin Bergemann committed
120
 
antarcticrainforest's avatar
antarcticrainforest committed
121
 
122