Skip to content
Snippets Groups Projects
Commit 4a143f1e authored by Jan Frederik Engels's avatar Jan Frederik Engels :new_moon:
Browse files

Some stuff.

parent 072dd2ea
No related branches found
No related tags found
1 merge request!72Draft: Compute devices lecture
Pipeline #69469 passed
--- ---
title: "Example lecture" title: "Computing devices"
author: "Tobias Kölling, Florian Ziemen, and the teams at MPI-M and DKRZ" subtitle: "Complications due to existing hardware"
author: "CF, GM, JFE FIXME"
--- ---
# Preface # Recap from last week
* What is parallelism?
* Which type of parallelism did we discuss?
* What is actually done in parallel?
* Which technique did we use for parallelisation?
* This is an example lecture for the generic computing skills course # Limits of OpenMP
* Can't scale one node
* Can get complicated if architecture is complicated
* FIXME: What else?
## Idea The technique is called "shared-memory parallelism".
*optimize output for **analysis** *
::: {.smaller} # Beyond shared memory
(not write throughput) If we want to scale beyond one node, what do we need in our example?
:::
FIXME show domain decomp, ask about last lecture with reductions in mind,
show discretisation which points to boundary-data exchange
# Beyond shared memory
If we want to scale beyond one node, what do we need in our example?
* Interconnect between nodes (Kategorienfehler?)
* Boundary-data exchange
* Reductions
# Distributed memory parallelism
* The dominating standard is MPI (Message Passing Interface)
* An MPI-Program has multiple **ranks** which can run on different nodes.
* Point-to-Point mechanism transfers data ("messages") between two ranks.
* Often used for boundary exchange
* A lot of the details are beyond the scope of this lecture.
## Collectives
* Acts collectively on all ranks.
* Can do:
* Broadcast
* Gather
* Scatter
* Reductions
* All-to-all (why is all-to-all a bad idea for boundary exchange?)
## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
* Find the maximum height of the wave and print it
* Can you also think of a way to find the position of the peak?
## MPI+X
* MPI can be combined with shared memory paradigms, like OpenMP or OpenACC
* Current state of the art
# But what actually does the work?
Existing technologies
* CPU
* GPU
* (NEC Vector)
* ~~FPGA~~
* 🦄?
## Categorization
* MIMD: Multiple instruction, multiple data
* Multiple cores in a CPU-system
* ?
* SIMD: Single instruction, multiple data
* "Warps"? inside GPUs
* Vectorization? inside a CPU core
FIXME: Explain more, maybe graphically
# From the pad
Computing devices (complications by exisiting hardware)
- What you saw last lecture is called "shared memory parallelism"
- Can't scale beyond one node
- Show limit of OMP (shared memory)
- Introduce distrib-memory parallelism (MPI), ideally with the same example.
- Interconnect exists
- Skim over pt2pt and collectives? Or is that already too much/complex?
- Hybrid, i.e. MPI+OMP or MPI+X
- Hands-on:
- Grab something from the homework and have them do it distrib-mem or even hybrid.
- Do a little more on CPU architecture (what exactly? Especially when not touching memory hierarchy.)
- [GM] Should we discuss some basic concepts like SIMD/SPMD/Flynn's taxonomy?
- Only do MIMD and SIMD
## chunking & hierarchy - GPUs. How?
- ~~~Point out the main differences (what is really important, what is "we've explained it like this since the 90s"?)~~~
- Explain GPU in terms of MIMD and SIMD
- Point out advantages and disadvantages
- Point out porting strategies
- There's OMP target and OpenACC and CUDA/HIP and GM is gonna add more cool stuff here
- Often memory is not shared between GPU and CPU, so you not only need to port the actual code, but also think about memory management.
- Best practice: Use as few mem transfer as possible (and motivate why)
- Here is links to ressources.
- Point out technical progress probably obsolecing a lot of knowledge and best practices
- Additional stuff:
- Interconnect: Latency bound vs bandwidth bound
- Obscure other architectures:
- NEC Vector
# Motivation
* We have a serial code and want to make it faster
* Plan of action:
* Cut problem into smaller pieces
* Use independent compute resources for each piece
* Outlook for next week: The individual computing element does no longer
get much faster, but there are more of them
* FIXME: What else?
## This lecture
* Is mostly about parallelism as a concept
* Next week: Hardware using this concept
# Introduce parallelism
Thinking about it, I think we should not give a theoretical definition here,
but first give the example and explain parallelism there. Eventually, with the
task-parallelism we should probably give a real definition and different flavours.
FIXME
# Our example problem
:::: {.columns} :::: {.columns}
::: {.column width="50%"} ::: {.column width="50%"}
| Grid | Cells | * 1d Tsunami equation
|---------:|--------:| * Korteweg–De Vries equation
| 1° by 1° | 0.06M | * Discretization not numerically accurate
| 10 km | 5.1M | * [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation)
| 5 km | 20M | * FIXME
| 1 km | 510M |
| 200 m | 12750M |
::: :::
::: {.column width="50%"} ::: {.column width="50%"}
| Screen | Pixels | FIXME show plot of a soliton
|------------:|--------:|
| VGA | 0.3M |
| Full HD | 2.1M |
| MacBook 13' | 4.1M |
| 4K | 8.8M |
| 8K | 35.4M |
::: :::
:::: ::::
It's **impossible** to look at the entire globe in full resolution. # Our example problem
FIXME
show some central loop
# Decomposing problem domains
## Our problem domain
FIXME
## Other problem domains
* ICONs domain decomp
* maybe something totally different?
FIXME
# Introducing OpenMP
* A popular way to parallelize code
* Pragma-based parallelization API
* You annotate your code with parallel regions and the compiler does the rest
* OpenMP uses something called threads
* Wait until next week for a definition
```c
#pragma omp parallel for
for (int i = 0; i < N; ++i)
a[i] = 2 * i;
```
## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
1. Compile and run the example serially. Use `time ./serial.x` to time the execution.
23. Compile and run the example using OpenMP. Use `OMP_NUM_THREADS=2 time ./omp.x` to time the execution.
42. Now add
* `schedule(static,1)`
* `schedule(static,10)`
* `schedule(FIXMEsomethingelse)`
* `schedule(FIXMEsomethingelse)`
and find out how the OpenMP runtime decomposes the problem domain.
# Reductions FIXME title should be more generic
## What is happening here?
```c
int a[] = {2, 4, 6};
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
## What is happening here?
```c
int a[] = {2, 4, 6};
#pragma omp parallel for
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
[comment]: # (Can something go wrong?)
## Solution
```c
int a[] = {2, 4, 6};
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
# Doing stuff wrong
## What is going wrong here?
```c
temp = 0;
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
## Solution
```c
temp = 0;
#pragma omp parallel for private(temp)
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
The problem is called "data race".
## Other common errors
* Race conditions
* The outcome of a program depends on the relative timing of multiple threads.
* Deadlocks
* Multiple threads wait for a resource that cannot be fulfilled.
* Inconsistency (FIXME: What is that?)
## Load data at the resolution necessary for the analysis # Finally: A definition of parallelism
"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously."
Wikipedia
FIXME: Citation correct?!
![[Górski et al, 2022: The HEALPix primer](https://healpix.jpl.nasa.gov/pdf/intro.pdf)](https://easy.gems.dkrz.de/_images/gorski_f1.jpg) ## Types of parallelism
* Data parallelism (what we've been discussing)
* Task parallelism (Example: Atmosphere ocean coupling)
* Instruction level parallelism (see next week)
## Highlight! {background-color=var(--dark-bg-color)} # FIXME
* Homework:
* Do something where you run into hardware-constraints (i.e. Numa, too many threads, ...)
* Give some example with race condition or stuff and have them find it.
* Add maybe:
* Are there theoretical concept like Amdahl, which we should explain? (I don't like Amdahl)
* Strong/weak scaling?
* This slide is either important or has a special purpose.
* You can use it to ask the audience a question or to start a hands-on session.
## Leftovers from previous talk ## Leftovers from previous talk
Best practices for efficient scaling: Best practices for efficient scaling:
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment