slides.qmd

---
title: "Parallelism"
author: "Georgiana Mania, Claudia Frauen, Jan Frederik Engels"
---

# Motivation

* We have a serial code and want to make it faster
* Plan of action:
  * Cut problem into smaller pieces
  * Use independent compute resources for each piece
* Outlook for next week: The individual computing element does no longer
  get much faster, but there are more of them


## This lecture

* Is mostly about parallelism as a concept
* Next week: Complications due to existing hardware


# Our example problem

:::: {.columns}

::: {.column width="50%"}
* 1d Tsunami equation
* Korteweg–De Vries equation - "PDE modeling waves on shallow water surfaces" [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation)
* Discretization not numerically accurate
:::

::: {.column width="50%"}
FIXME show plot of a soliton
:::

::::

# Our example problem
```c
while ( t < 50 ){
        //[...]
        for(i=0; i<npoints; i++){
                //[...]
                u[i]=u2[i]-(delta_t/delta_x)*lu(h1,u1,i);
                h[i]=h2[i]-(delta_t/delta_x)*lh(h1,u1,i);
        }
}
```

# Decomposing problem domains
## Our problem domain
FIXME

## Other problem domains

:::: {.columns}

::: {.column width="50%"}

<br>

* Climate model (e.g. ICON) horizontal grid decomposition

:::

::: {.column width="50%"}

![](static/csm_icon_grid_single_noborder_decomp.png){width=100%}
::: {.tiny}

adapted from original image by MPI-M

:::

:::

::::


# Introducing OpenMP
* A popular way to parallelize code
* Pragma-based parallelization API
  * You annotate your code with parallel regions and the compiler does the rest

```c++
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        a[i] = 2 * i;
```

# OpenMP threads

* OpenMP uses something called threads
  * Wait until next week for a definition

:::: {.columns}

::: {.column width="50%"}

<br>

```c++
N = 8
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        a[i] = 2 * i;
```

:::

::: {.column width="50%"}

![](static/threads.jpg){width=100%}

:::

::::

# Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
1. Load the GNU compiler on Levante
```bash
module load gcc
```
2. Compile and run the serial example
```bash
gcc main.c -o serial.x -lm
time ./serial.x   # use `time` to check the runtime
```
3. Compile and run the example using OpenMP
```bash
gcc -fopenmp main.c -o parallel.x -lm
export OMP_NUM_THREADS=2
time ./parallel.x
```

4. See next slide!
# Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}

4. Now compile/run with
```bash
gcc -fopenmp main.c -o parallel.x -lm -DWRITE_DECOMP
export OMP_NUM_THREADS=4 
time ./parallel.x
```
5. What does the additional output mean?

6. Now uncomment/adapt
   * `schedule(static,100)`
   * `schedule(static,10)`
and interpret the results.

# Scaling

Bake mini-pancake dessert

:::: {.columns}

::: {.column width="25%"}

![](static/pancakes_stack.png){width=100%}

:::

::: {.column width="25%" }

:::{.fragment}
![](static/one_pancake.png){width=100%}
:::
:::

::: {.column width="25%"}
:::{.fragment}
![](static/four_pans_cake.png){width=100%}
:::
:::

::: {.column width="25%"}
:::{.fragment}
![](static/four_pancakes.png){width=100%}
:::
:::

::::

:::{.info .tiny}
Images generated by Pradipta Samanta with DALL-E
:::


## Strong vs weak scaling

:::{.smaller}
Starting with 1 pan, we can (perfect) scale this by using $P$ pans in two ways:
:::

:::{.fragment}
|Parameter / Scaling type |Strong|Weak|
|-------|---|---|
|Resources <br> (e.g. pans) | $P$| $P$ |
|Total workload <br> (e.g. pancake count)| $N$ | $N \times P$ |
|Workload per worker <br> (e.g. pancakes per pan) | $N/P$ | $N$ |
|Total time | $T_1 \times N/P$ | $T_1 \times N \times P$ |
:::
<!--
* What happens if one uses more threads, but keep the problem size?
  * Strong scaling
* What happens if one uses more threads, but also increases the problem size by
  the same factor?
  * Weak scaling
  --> 

# Aggregate parallel results
## What is happening here?
```c++
    int a[] = {2, 4, 6};
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];
```
## What is happening here?
```c++
    int a[] = {2, 4, 6};
#pragma omp parallel for
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];
```
[comment]: # (Can something go wrong?)

## Solution
```c++
    int a[] = {2, 4, 6};
#pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < N; ++i)
        sum = sum + a[i];
```
For other reduction operations like e.g. `max` see
[OpenMP documentation](https://www.openmp.org/spec-html/5.1/openmpsu117.html#x152-1720002.21.5)


# Doing stuff wrong
## What is going wrong here?
```c++
    temp = 0;
#pragma omp parallel for
    for (int i = 0; i < N; ++i) {
        temp = 2 * a[i];
        b[i] = temp + 4;
    }
```
## Solution
```c++
    temp = 0;
#pragma omp parallel for private(temp)
    for (int i = 0; i < N; ++i) {
        temp = 2 * a[i];
        b[i] = temp + 4;
    }
```
The problem is called "data race".

## Other common errors
* Race conditions
  * The outcome of a program depends on the relative timing of multiple threads.
* Deadlocks
  * Multiple threads wait for a resource that cannot be fulfilled.
* Starvation
  * A thread is blocked indefinitely waiting on a resource 

# Finally: A definition of parallelism
<!--"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously."
Wikipedia
-->

"Parallel computing is the simultaneous use of multiple compute resources to solve a computational problem"

:::{.smaller}
-- *Introduction to Parallel Computing Tutorial, LLNL *
:::


## Types of parallelism

* Data-level parallelism
  * what we've been discussing
* Task-level parallelism
  * Example: Atmosphere ocean coupling


## Precondition for parallel execution

<br>

*"Two consecutive instructions or code segments can be executed in parallel if they are **independent**."* 

<br>

:::{.fragment .info .smaller}
What does **dependence** mean?
:::

## Code and data dependence {.leftalign}


:::{.fragment .fade-in fragment-index=1}
* Data dependence - the instructions share the same data 
:::

:::{.fragment .fade-in-then-semi-out fragment-index=2}
```c++ 
a = b;
c = a + b; //  flow dependence
```
:::

:::{.fragment .fade-in fragment-index=3}
* Control dependence - execution order decided at runtime 
:::

:::{.fragment .fade-in-then-semi-out fragment-index=4}
```c++
for (int i = 1; i < n ; i++) {
  if (a[i-1] > a[i]) { 
    a[i] = a[i] + 1;
   } else {
    a[i] = 0;
   } 
} 
```
:::

:::{.fragment .fade-in fragment-index=5}
* Resource dependence - share the same resource
:::

:::{.fragment .fade-in-then-semi-out fragment-index=6}
```c++
a = b;
b = 42; // read after write: a has an old value
```
:::


## Bernstein's parallelism conditions {.alignleft}
For data dependence, use the Bernstein's conditions: *"the intersection between read-write set, write-read set and write-write set of instructions is null"*.

:::{.fragment}
```c++
c = a + b;  // S1
d = a - b;  // S2
```
:::

:::{.smaller}

:::: {.columns}

::: {.column width="50%"}

:::{.fragment}

Read and write sets for S1 and S2:
$$
R_1 = \{a,b\} ; W_1 = \{c\} \\
R_2 = \{a,b\} ; W_2 = \{d\} \\
$$

:::
:::

::: {.column width="50%"}

:::{.fragment}
Bernstein's conditions:

$$
R_1 \cap W_2 = \emptyset \\
W_1 \cap R_2 = \emptyset \\
W_1 \cap W_2 = \emptyset
$$
:::

:::
:::
::::

:::{.fragment}
S1 and S2 can be executed in parallel!
:::

:::{.notes}
How about these two? replace a in S2 with c
```c++ 
c = a + b;  // S1
d = c - b;  // S2
```
:::

## Bernstein's parallelism conditions {.alignleft}

:::{.fragment .semi-fade-out}
```c++
c = a + b;  // S1
d = a - b;  // S2
```
:::


:::{.fragment}
Replace `a` with `c` in statement 2
```c++
c = a + b;  // S1
d = c - b;  // S2
```
:::

:::{.smaller}

:::: {.columns}

::: {.column width="50%"}

:::{.fragment}

Read and write sets for S1 and S2:
$$
R_1 = \{a,b\} ; W_1 = \{c\} \\
R_2 = \{c,b\} ; W_2 = \{d\} \\
$$

:::
:::

::: {.column width="50%"}

:::{.fragment}
Bernstein's conditions:

$$
R_1 \cap W_2 = \emptyset \\
W_1 \cap R_2 = \{c\} \\
W_1 \cap W_2 = \emptyset
$$
:::

:::
:::
::::

:::{.fragment}
S1 and S2 can NOT be executed in parallel!
:::


## Best practices

* Parallelisation should not change the results! Exceptions to be discussed next week!


# Additional reading

* "Computer Architecture - A Quantitative Approach" - J. Hennessy and D. Patterson
* "Introduction to High Performance Computing for Scientists and Engineers" - G. Hager and G. Wellein
* "Introduction to Parallel Computing Tutorial" - [Online](https://hpc.llnl.gov/documentation/tutorials/introduction-parallel-computing-tutorial), Lawrence Livermore National Laboratory