-
Georgiana Mania authoredGeorgiana Mania authored
slides.qmd 8.54 KiB
---
title: "Parallelism"
author: "Georgiana Mania, Claudia Frauen, Jan Frederik Engels"
---
# Motivation
* We have a serial code and want to make it faster
* Plan of action:
* Cut problem into smaller pieces
* Use independent compute resources for each piece
* Outlook for next week: The individual computing element does no longer
get much faster, but there are more of them
## This lecture
* Is mostly about parallelism as a concept
* Next week: Complications due to existing hardware
# Our example problem
:::: {.columns}
::: {.column width="50%"}
* 1d Tsunami equation
* Korteweg–De Vries equation - "PDE modeling waves on shallow water surfaces" [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation)
* Discretization not numerically accurate
:::
::: {.column width="50%"}
FIXME show plot of a soliton
:::
::::
# Our example problem
```c
while ( t < 50 ){
//[...]
for(i=0; i<npoints; i++){
//[...]
u[i]=u2[i]-(delta_t/delta_x)*lu(h1,u1,i);
h[i]=h2[i]-(delta_t/delta_x)*lh(h1,u1,i);
}
}
```
# Decomposing problem domains
## Our problem domain
FIXME
## Other problem domains
:::: {.columns}
::: {.column width="50%"}
<br>
* Climate model (e.g. ICON) horizontal grid decomposition
:::
::: {.column width="50%"}
{width=100%}
::: {.tiny}
adapted from original image by MPI-M
:::
:::
::::
# Introducing OpenMP
* A popular way to parallelize code
* Pragma-based parallelization API
* You annotate your code with parallel regions and the compiler does the rest
```c++
#pragma omp parallel for
for (int i = 0; i < N; ++i)
a[i] = 2 * i;
```
# OpenMP threads
* OpenMP uses something called threads
* Wait until next week for a definition
:::: {.columns}
::: {.column width="50%"}
<br>
```c++
N = 8
#pragma omp parallel for
for (int i = 0; i < N; ++i)
a[i] = 2 * i;
```
:::
::: {.column width="50%"}
{width=100%}
:::
::::
# Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
1. Load the GNU compiler on Levante
```bash
module load gcc
```
2. Compile and run the serial example
```bash
gcc main.c -o serial.x -lm
time ./serial.x # use `time` to check the runtime
```
3. Compile and run the example using OpenMP
```bash
gcc -fopenmp main.c -o parallel.x -lm
export OMP_NUM_THREADS=2
time ./parallel.x
```
4. See next slide!
# Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
4. Now compile/run with
```bash
gcc -fopenmp main.c -o parallel.x -lm -DWRITE_DECOMP
export OMP_NUM_THREADS=4
time ./parallel.x
```
5. What does the additional output mean?
6. Now uncomment/adapt
* `schedule(static,100)`
* `schedule(static,10)`
and interpret the results.
# Scaling
Bake mini-pancake dessert
:::: {.columns}
::: {.column width="25%"}
{width=100%}
:::
::: {.column width="25%" }
:::{.fragment}
{width=100%}
:::
:::
::: {.column width="25%"}
:::{.fragment}
{width=100%}
:::
:::
::: {.column width="25%"}
:::{.fragment}
{width=100%}
:::
:::
::::
:::{.info .tiny}
Images generated by Pradipta Samanta with DALL-E
:::
## Strong vs weak scaling
:::{.smaller}
Starting with 1 pan, we can (perfect) scale this by using $P$ pans in two ways:
:::
:::{.fragment}
|Parameter / Scaling type |Strong|Weak|
|-------|---|---|
|Resources <br> (e.g. pans) | $P$| $P$ |
|Total workload <br> (e.g. pancake count)| $N$ | $N \times P$ |
|Workload per worker <br> (e.g. pancakes per pan) | $N/P$ | $N$ |
|Total time | $T_1 \times N/P$ | $T_1 \times N \times P$ |
:::
<!--
* What happens if one uses more threads, but keep the problem size?
* Strong scaling
* What happens if one uses more threads, but also increases the problem size by
the same factor?
* Weak scaling
-->
# Aggregate parallel results
## What is happening here?
```c++
int a[] = {2, 4, 6};
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
## What is happening here?
```c++
int a[] = {2, 4, 6};
#pragma omp parallel for
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
[comment]: # (Can something go wrong?)
## Solution
```c++
int a[] = {2, 4, 6};
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
For other reduction operations like e.g. `max` see
[OpenMP documentation](https://www.openmp.org/spec-html/5.1/openmpsu117.html#x152-1720002.21.5)
# Doing stuff wrong
## What is going wrong here?
```c++
temp = 0;
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
## Solution
```c++
temp = 0;
#pragma omp parallel for private(temp)
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
The problem is called "data race".
## Other common errors
* Race conditions
* The outcome of a program depends on the relative timing of multiple threads.
* Deadlocks
* Multiple threads wait for a resource that cannot be fulfilled.
* Starvation
* A thread is blocked indefinitely waiting on a resource
# Finally: A definition of parallelism
<!--"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously."
Wikipedia
-->
"Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem"
:::{.smaller}
-- *Introduction to Parallel Computing Tutorial, LLNL *
:::
## Types of parallelism
* Data-level parallelism
* what we've been discussing
* Task-level parallelism
* Example: Atmosphere ocean coupling
## Precondition for parallel execution
<br>
*"Two consecutive instructions or code segments can be executed in parallel if they are **independent**."*
<br>
:::{.fragment .info .smaller}
What does **dependence** mean?
:::
## Code and data dependence {.leftalign}
:::{.fragment .fade-in fragment-index=1}
* Data dependence - the instructions share the same data
:::
:::{.fragment .fade-in-then-semi-out fragment-index=2}
```c++
a = b;
c = a + b; // flow dependence
```
:::
:::{.fragment .fade-in fragment-index=3}
* Control dependence - execution order decided at runtime
:::
:::{.fragment .fade-in-then-semi-out fragment-index=4}
```c++
for (int i = 1; i < n ; i++) {
if (a[i-1] > a[i]) {
a[i] = a[i] + 1;
} else {
a[i] = 0;
}
}
```
:::
:::{.fragment .fade-in fragment-index=5}
* Resource dependence - share the same resource
:::
:::{.fragment .fade-in-then-semi-out fragment-index=6}
```c++
a = b;
b = 42; // read after write: a has an old value
```
:::
## Bernstein's parallelism conditions {.alignleft}
For data dependence, use the Bernstein's conditions: *"the intersection between read-write set, write-read set and write-write set of instructions is null"*.
:::{.fragment}
```c++
c = a + b; // S1
d = a - b; // S2
```
:::
:::{.smaller}
:::: {.columns}
::: {.column width="50%"}
:::{.fragment}
Read and write sets for S1 and S2:
$$
R_1 = \{a,b\} ; W_1 = \{c\} \\
R_2 = \{a,b\} ; W_2 = \{d\} \\
$$
:::
:::
::: {.column width="50%"}
:::{.fragment}
Bernstein's conditions:
$$
R_1 \cap W_2 = \emptyset \\
W_1 \cap R_2 = \emptyset \\
W_1 \cap W_2 = \emptyset
$$
:::
:::
:::
::::
:::{.fragment}
S1 and S2 can be executed in parallel!
:::
:::{.notes}
How about these two? replace a in S2 with c
```c++
c = a + b; // S1
d = c - b; // S2
```
:::
## Bernstein's parallelism conditions {.alignleft}
:::{.fragment .semi-fade-out}
```c++
c = a + b; // S1
d = a - b; // S2
```
:::
:::{.fragment}
Replace `a` with `c` in statement 2
```c++
c = a + b; // S1
d = c - b; // S2
```
:::
:::{.smaller}
:::: {.columns}
::: {.column width="50%"}
:::{.fragment}
Read and write sets for S1 and S2:
$$
R_1 = \{a,b\} ; W_1 = \{c\} \\
R_2 = \{c,b\} ; W_2 = \{d\} \\
$$
:::
:::
::: {.column width="50%"}
:::{.fragment}
Bernstein's conditions:
$$
R_1 \cap W_2 = \emptyset \\
W_1 \cap R_2 = \{c\} \\
W_1 \cap W_2 = \emptyset
$$
:::
:::
:::
::::
:::{.fragment}
S1 and S2 can NOT be executed in parallel!
:::
## Best practices
* Parallelisation should not change the results! Exceptions to be discussed next week!
# Additional reading
* "Computer Architecture - A Quantitative Approach" - J. Hennessy and D. Patterson
* "Introduction to High Performance Computing for Scientists and Engineers" - G. Hager and G. Wellein
* "Introduction to Parallel Computing Tutorial" - [Online](https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial), Lawrence Livermore National Laboratory