Skip to content
Snippets Groups Projects
Commit a36bfa35 authored by Marco Kulüke's avatar Marco Kulüke
Browse files

dask push

parent 26ab82fc
No related branches found
No related tags found
1 merge request!10Develop
%% Cell type:markdown id: tags:
# How to use Dask for Climate Data Processing?
This tutorial builds requires the skills, which have learned in the summer days tutorial (provide link)
1. What is Dask?
1. Parallelism
1. Overview
1. `dask.delayed`
2. Which data can be processed with dask?
3. Common mistakes
%% Cell type:markdown id: tags:
## 1. What is Dask?
Dask is an open source library for parallel computing written in Python. It is used to process larger-than memory datasets (e.g. large climate data sets). All information can be found here: https://docs.dask.org
%% Cell type:markdown id: tags:
### 1C. Parallelism
- use Maria's metaphor
%% Cell type:markdown id: tags:
### 1B. Dask Overview
%% Cell type:markdown id: tags:
### 1C. `dask.delayed`
%% Cell type:markdown id: tags:
Let us start with an easy example
%% Cell type:code id: tags:
``` python
from dask.distributed import Client
client = Client(n_workers=4)
client
```
%% Cell type:code id: tags:
``` python
from dask import delayed
import time
```
%% Cell type:code id: tags:
``` python
@delayed
def add(x, y):
result = x +y
return result
```
%% Cell type:code id: tags:
``` python
result = 0
for i in range(0,10):
result = add(result, i)
output = []
for i in range(0,11):
output.append(i)
result_delayed = delayed(sum)(output)
```
%% Cell type:code id: tags:
``` python
result_delayed.visualize()
```
%% Cell type:markdown id: tags:
Not Parallel
%% Cell type:code id: tags:
``` python
%%time
def inc(x):
time.sleep(0.5)
return x + 1
def double(x):
time.sleep(0.5)
return 2 * x
def add(x, y):
time.sleep(0.5)
return x + y
data = list(range(10))
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total = sum(output)
total
```
%% Cell type:markdown id: tags:
Parallel
%% Cell type:code id: tags:
``` python
result.visualize()
@delayed
def inc(x):
time.sleep(0.5)
return x + 1
@delayed
def double(x):
time.sleep(0.5)
return 2 * x
@delayed
def add(x, y):
time.sleep(0.5)
return x + y
data = list(range(10))
output = []
for x in data:
a = inc(x)
b = double(x)
c = add(a, b)
output.append(c)
total_delayed = delayed(sum)(output) #also delay sum because it is a function
%time total_delayed.compute()
```
%% Output
%% Cell type:code id: tags:
<IPython.core.display.Image object>
``` python
total_delayed.visualize()
```
%% Cell type:code id: tags:
``` python
result.compute()
result_delayed.compute()
```
%% Output
%% Cell type:code id: tags:
45
``` python
client.close()
```
%% Cell type:markdown id: tags:
# <font color='darkgreen'>Take home message</font>
Prallelism brings extra complexiity and often it is not necessary for your problems. Before using Dask you may want try alternatives:
- use better algorithms or data structures
- better file formats
- compiled code
- sampling
- profile your code
%% Cell type:code id: tags:
``` python
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment