From 4a143f1e939efd69281849a887c51ce788a5f0db Mon Sep 17 00:00:00 2001 From: jfe <git@jfengels.de> Date: Thu, 30 May 2024 17:10:37 +0200 Subject: [PATCH] Some stuff. --- lectures/hardware/slides.qmd | 270 +++++++++++++++++++++++++++++++---- 1 file changed, 240 insertions(+), 30 deletions(-) diff --git a/lectures/hardware/slides.qmd b/lectures/hardware/slides.qmd index 93c6df4..9f9bf9a 100644 --- a/lectures/hardware/slides.qmd +++ b/lectures/hardware/slides.qmd @@ -1,57 +1,267 @@ --- -title: "Example lecture" -author: "Tobias Kölling, Florian Ziemen, and the teams at MPI-M and DKRZ" +title: "Computing devices" +subtitle: "Complications due to existing hardware" +author: "CF, GM, JFE FIXME" --- -# Preface +# Recap from last week +* What is parallelism? +* Which type of parallelism did we discuss? +* What is actually done in parallel? +* Which technique did we use for parallelisation? -* This is an example lecture for the generic computing skills course +# Limits of OpenMP +* Can't scale one node +* Can get complicated if architecture is complicated +* FIXME: What else? -## Idea -*optimize output for **analysis** * +The technique is called "shared-memory parallelism". -::: {.smaller} -(not write throughput) -::: +# Beyond shared memory +If we want to scale beyond one node, what do we need in our example? + +FIXME show domain decomp, ask about last lecture with reductions in mind, +show discretisation which points to boundary-data exchange + +# Beyond shared memory +If we want to scale beyond one node, what do we need in our example? + +* Interconnect between nodes (Kategorienfehler?) +* Boundary-data exchange +* Reductions + +# Distributed memory parallelism +* The dominating standard is MPI (Message Passing Interface) +* An MPI-Program has multiple **ranks** which can run on different nodes. +* Point-to-Point mechanism transfers data ("messages") between two ranks. + * Often used for boundary exchange +* A lot of the details are beyond the scope of this lecture. + +## Collectives +* Acts collectively on all ranks. +* Can do: + * Broadcast + * Gather + * Scatter + * Reductions + * All-to-all (why is all-to-all a bad idea for boundary exchange?) + +## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign} + +* Find the maximum height of the wave and print it +* Can you also think of a way to find the position of the peak? + +## MPI+X + +* MPI can be combined with shared memory paradigms, like OpenMP or OpenACC +* Current state of the art + +# But what actually does the work? + +Existing technologies + +* CPU +* GPU +* (NEC Vector) +* ~~FPGA~~ +* 🦄? + +## Categorization + +* MIMD: Multiple instruction, multiple data + * Multiple cores in a CPU-system + * ? +* SIMD: Single instruction, multiple data + * "Warps"? inside GPUs + * Vectorization? inside a CPU core + +FIXME: Explain more, maybe graphically + +# From the pad +Computing devices (complications by exisiting hardware) +- What you saw last lecture is called "shared memory parallelism" + - Can't scale beyond one node + - Show limit of OMP (shared memory) +- Introduce distrib-memory parallelism (MPI), ideally with the same example. + - Interconnect exists + +- Skim over pt2pt and collectives? Or is that already too much/complex? + +- Hybrid, i.e. MPI+OMP or MPI+X + +- Hands-on: + - Grab something from the homework and have them do it distrib-mem or even hybrid. +- Do a little more on CPU architecture (what exactly? Especially when not touching memory hierarchy.) + - [GM] Should we discuss some basic concepts like SIMD/SPMD/Flynn's taxonomy? + - Only do MIMD and SIMD -## chunking & hierarchy +- GPUs. How? + - ~~~Point out the main differences (what is really important, what is "we've explained it like this since the 90s"?)~~~ + - Explain GPU in terms of MIMD and SIMD + - Point out advantages and disadvantages + - Point out porting strategies + - There's OMP target and OpenACC and CUDA/HIP and GM is gonna add more cool stuff here + - Often memory is not shared between GPU and CPU, so you not only need to port the actual code, but also think about memory management. + - Best practice: Use as few mem transfer as possible (and motivate why) + - Here is links to ressources. + - Point out technical progress probably obsolecing a lot of knowledge and best practices + + +- Additional stuff: + - Interconnect: Latency bound vs bandwidth bound + - Obscure other architectures: + - NEC Vector + +# Motivation + +* We have a serial code and want to make it faster +* Plan of action: + * Cut problem into smaller pieces + * Use independent compute resources for each piece +* Outlook for next week: The individual computing element does no longer + get much faster, but there are more of them +* FIXME: What else? + + + +## This lecture + +* Is mostly about parallelism as a concept +* Next week: Hardware using this concept + +# Introduce parallelism +Thinking about it, I think we should not give a theoretical definition here, +but first give the example and explain parallelism there. Eventually, with the +task-parallelism we should probably give a real definition and different flavours. +FIXME + +# Our example problem :::: {.columns} ::: {.column width="50%"} -| Grid | Cells | -|---------:|--------:| -| 1° by 1° | 0.06M | -| 10 km | 5.1M | -| 5 km | 20M | -| 1 km | 510M | -| 200 m | 12750M | +* 1d Tsunami equation +* Korteweg–De Vries equation +* Discretization not numerically accurate +* [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation) +* FIXME ::: ::: {.column width="50%"} -| Screen | Pixels | -|------------:|--------:| -| VGA | 0.3M | -| Full HD | 2.1M | -| MacBook 13' | 4.1M | -| 4K | 8.8M | -| 8K | 35.4M | +FIXME show plot of a soliton ::: :::: -It's **impossible** to look at the entire globe in full resolution. +# Our example problem +FIXME +show some central loop + +# Decomposing problem domains +## Our problem domain +FIXME + +## Other problem domains +* ICONs domain decomp +* maybe something totally different? + +FIXME + +# Introducing OpenMP +* A popular way to parallelize code +* Pragma-based parallelization API + * You annotate your code with parallel regions and the compiler does the rest +* OpenMP uses something called threads + * Wait until next week for a definition + +```c +#pragma omp parallel for + for (int i = 0; i < N; ++i) + a[i] = 2 * i; +``` + +## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign} + +1. Compile and run the example serially. Use `time ./serial.x` to time the execution. +23. Compile and run the example using OpenMP. Use `OMP_NUM_THREADS=2 time ./omp.x` to time the execution. +42. Now add + * `schedule(static,1)` + * `schedule(static,10)` + * `schedule(FIXMEsomethingelse)` + * `schedule(FIXMEsomethingelse)` +and find out how the OpenMP runtime decomposes the problem domain. + +# Reductions FIXME title should be more generic +## What is happening here? +```c + int a[] = {2, 4, 6}; + for (int i = 0; i < N; ++i) + sum = sum + a[i]; +``` +## What is happening here? +```c + int a[] = {2, 4, 6}; +#pragma omp parallel for + for (int i = 0; i < N; ++i) + sum = sum + a[i]; +``` +[comment]: # (Can something go wrong?) + +## Solution +```c + int a[] = {2, 4, 6}; +#pragma omp parallel for reduction(+:sum) + for (int i = 0; i < N; ++i) + sum = sum + a[i]; +``` + +# Doing stuff wrong +## What is going wrong here? +```c + temp = 0; +#pragma omp parallel for + for (int i = 0; i < N; ++i) { + temp = 2 * a[i]; + b[i] = temp + 4; + } +``` +## Solution +```c + temp = 0; +#pragma omp parallel for private(temp) + for (int i = 0; i < N; ++i) { + temp = 2 * a[i]; + b[i] = temp + 4; + } +``` +The problem is called "data race". +## Other common errors +* Race conditions + * The outcome of a program depends on the relative timing of multiple threads. +* Deadlocks + * Multiple threads wait for a resource that cannot be fulfilled. +* Inconsistency (FIXME: What is that?) -## Load data at the resolution necessary for the analysis +# Finally: A definition of parallelism +"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously." +Wikipedia +FIXME: Citation correct?! -](https://easy.gems.dkrz.de/_images/gorski_f1.jpg) +## Types of parallelism +* Data parallelism (what we've been discussing) +* Task parallelism (Example: Atmosphere ocean coupling) +* Instruction level parallelism (see next week) -## Highlight! {background-color=var(--dark-bg-color)} +# FIXME +* Homework: + * Do something where you run into hardware-constraints (i.e. Numa, too many threads, ...) + * Give some example with race condition or stuff and have them find it. +* Add maybe: + * Are there theoretical concept like Amdahl, which we should explain? (I don't like Amdahl) + * Strong/weak scaling? -* This slide is either important or has a special purpose. -* You can use it to ask the audience a question or to start a hands-on session. ## Leftovers from previous talk Best practices for efficient scaling: -- GitLab