"total_delayed = delayed(sum)(output) #also delay sum because it is a function\n",
"%time total_delayed.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"total_delayed.visualize()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"result_delayed.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"client.close()"
]
},
{
...
...
@@ -139,13 +230,6 @@
"- sampling\n",
"- profile your code"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
...
...
%% Cell type:markdown id: tags:
# How to use Dask for Climate Data Processing?
This tutorial builds requires the skills, which have learned in the summer days tutorial (provide link)
1. What is Dask?
1. Parallelism
1. Overview
1.`dask.delayed`
2. Which data can be processed with dask?
3. Common mistakes
%% Cell type:markdown id: tags:
## 1. What is Dask?
Dask is an open source library for parallel computing written in Python. It is used to process larger-than memory datasets (e.g. large climate data sets). All information can be found here: https://docs.dask.org
%% Cell type:markdown id: tags:
### 1C. Parallelism
- use Maria's metaphor
%% Cell type:markdown id: tags:
### 1B. Dask Overview
%% Cell type:markdown id: tags:
### 1C. `dask.delayed`
%% Cell type:markdown id: tags:
Let us start with an easy example
%% Cell type:code id: tags:
``` python
fromdask.distributedimportClient
client=Client(n_workers=4)
client
```
%% Cell type:code id: tags:
``` python
fromdaskimportdelayed
importtime
```
%% Cell type:code id: tags:
``` python
@delayed
defadd(x,y):
result=x+y
returnresult
```
%% Cell type:code id: tags:
``` python
result=0
foriinrange(0,10):
result=add(result,i)
output=[]
foriinrange(0,11):
output.append(i)
result_delayed=delayed(sum)(output)
```
%% Cell type:code id: tags:
``` python
result_delayed.visualize()
```
%% Cell type:markdown id: tags:
Not Parallel
%% Cell type:code id: tags:
``` python
%%time
definc(x):
time.sleep(0.5)
returnx+1
defdouble(x):
time.sleep(0.5)
return2*x
defadd(x,y):
time.sleep(0.5)
returnx+y
data=list(range(10))
output=[]
forxindata:
a=inc(x)
b=double(x)
c=add(a,b)
output.append(c)
total=sum(output)
total
```
%% Cell type:markdown id: tags:
Parallel
%% Cell type:code id: tags:
``` python
result.visualize()
@delayed
definc(x):
time.sleep(0.5)
returnx+1
@delayed
defdouble(x):
time.sleep(0.5)
return2*x
@delayed
defadd(x,y):
time.sleep(0.5)
returnx+y
data=list(range(10))
output=[]
forxindata:
a=inc(x)
b=double(x)
c=add(a,b)
output.append(c)
total_delayed=delayed(sum)(output)#also delay sum because it is a function
%timetotal_delayed.compute()
```
%% Output
%% Cell type:code id: tags:
<IPython.core.display.Image object>
``` python
total_delayed.visualize()
```
%% Cell type:code id: tags:
``` python
result.compute()
result_delayed.compute()
```
%% Output
%% Cell type:code id: tags:
45
``` python
client.close()
```
%% Cell type:markdown id: tags:
# <font color='darkgreen'>Take home message</font>
Prallelism brings extra complexiity and often it is not necessary for your problems. Before using Dask you may want try alternatives: