From 4a143f1e939efd69281849a887c51ce788a5f0db Mon Sep 17 00:00:00 2001
From: jfe <git@jfengels.de>
Date: Thu, 30 May 2024 17:10:37 +0200
Subject: [PATCH] Some stuff.

---
 lectures/hardware/slides.qmd | 270 +++++++++++++++++++++++++++++++----
 1 file changed, 240 insertions(+), 30 deletions(-)

diff --git a/lectures/hardware/slides.qmd b/lectures/hardware/slides.qmd
index 93c6df4..9f9bf9a 100644
--- a/lectures/hardware/slides.qmd
+++ b/lectures/hardware/slides.qmd
@@ -1,57 +1,267 @@
 ---
-title: "Example lecture"
-author: "Tobias KÃ¶lling, Florian Ziemen, and the teams at MPI-M and DKRZ"
+title: "Computing devices"
+subtitle: "Complications due to existing hardware"
+author: "CF, GM, JFE FIXME"
 ---
 
-# Preface
+# Recap from last week
+* What is parallelism?
+* Which type of parallelism did we discuss?
+* What is actually done in parallel?
+* Which technique did we use for parallelisation?
 
-* This is an example lecture for the generic computing skills course
+# Limits of  OpenMP
+* Can't scale one node
+* Can get complicated if architecture is complicated
+* FIXME: What else?
 
-## Idea
-*optimize output for **analysis** *
+The technique is called "shared-memory parallelism".
 
-::: {.smaller}
-(not write throughput)
-:::
+# Beyond shared memory
+If we want to scale beyond one node, what do we need in our example?
+
+FIXME show domain decomp, ask about last lecture with reductions in mind,
+show discretisation which points to boundary-data exchange
+
+# Beyond shared memory
+If we want to scale beyond one node, what do we need in our example?
+
+* Interconnect between nodes (Kategorienfehler?)
+* Boundary-data exchange
+* Reductions
+
+# Distributed memory parallelism
+* The dominating standard is MPI (Message Passing Interface)
+* An MPI-Program has multiple **ranks** which can run on different nodes.
+* Point-to-Point mechanism transfers data ("messages") between two ranks.
+  * Often used for boundary exchange
+* A lot of the details are beyond the scope of this lecture.
+
+## Collectives
+* Acts collectively on all ranks.
+* Can do:
+  * Broadcast
+  * Gather
+  * Scatter
+  * Reductions
+  * All-to-all (why is all-to-all a bad idea for boundary exchange?)
+
+## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
+
+* Find the maximum height of the wave and print it
+* Can you also think of a way to find the position of the peak?
+
+## MPI+X
+
+* MPI can be combined with shared memory paradigms, like OpenMP or OpenACC
+* Current state of the art
+
+# But what actually does the work?
+
+Existing technologies
+
+* CPU
+* GPU
+* (NEC Vector)
+* ~~FPGA~~
+* ðŸ¦„?
+
+## Categorization
+
+* MIMD: Multiple instruction, multiple data
+  * Multiple cores in a CPU-system
+  * ?
+* SIMD: Single instruction, multiple data
+  * "Warps"? inside GPUs
+  * Vectorization? inside a CPU core
+
+FIXME: Explain more, maybe graphically
+
+# From the pad
+Computing devices (complications by exisiting hardware)
+- What you saw last lecture is called "shared memory parallelism"
+    - Can't scale beyond one node
+    - Show limit of OMP (shared memory)
+- Introduce distrib-memory parallelism (MPI), ideally with the same example.
+    - Interconnect exists
+
+- Skim over pt2pt and collectives? Or is that already too much/complex?
+
+- Hybrid, i.e. MPI+OMP or MPI+X
+
+- Hands-on:
+    - Grab something from the homework and have them do it distrib-mem or even hybrid.
 
+- Do a little more on CPU architecture (what exactly? Especially when not touching memory hierarchy.)
+    - [GM] Should we discuss some basic concepts like SIMD/SPMD/Flynn's taxonomy?
+    - Only do MIMD and SIMD
 
-## chunking & hierarchy
+- GPUs. How?
+    - ~~~Point out the main differences (what is really important, what is "we've explained it like this since the 90s"?)~~~
+    - Explain GPU in terms of MIMD and SIMD
+    - Point out advantages and disadvantages
+    - Point out porting strategies
+        - There's OMP target and OpenACC and CUDA/HIP and GM is gonna add more cool stuff here
+        - Often memory is not shared between GPU and CPU, so you not only need to port the actual code, but also think about memory management.
+        - Best practice: Use as few mem transfer as possible (and motivate why)
+        - Here is links to ressources.
+    - Point out technical progress probably obsolecing a lot of knowledge and best practices
+
+
+- Additional stuff:
+    - Interconnect: Latency bound vs bandwidth bound
+    - Obscure other architectures:
+        - NEC Vector
+
+# Motivation
+
+* We have a serial code and want to make it faster
+* Plan of action:
+  * Cut problem into smaller pieces
+  * Use independent compute resources for each piece
+* Outlook for next week: The individual computing element does no longer
+  get much faster, but there are more of them
+* FIXME: What else?
+
+
+
+## This lecture
+
+* Is mostly about parallelism as a concept
+* Next week: Hardware using this concept
+
+# Introduce parallelism
+Thinking about it, I think we should not give a theoretical definition here,
+but first give the example and explain parallelism there. Eventually, with the
+task-parallelism we should probably give a real definition and different flavours.
+FIXME
+
+# Our example problem
 
 :::: {.columns}
 
 ::: {.column width="50%"}
-|   Grid   |  Cells  |
-|---------:|--------:|
-| 1Â° by 1Â° |   0.06M |
-|    10 km |    5.1M |
-|     5 km |     20M |
-|     1 km |    510M |
-|   200  m |  12750M |
+* 1d Tsunami equation
+* Kortewegâ€“De Vries equation
+* Discretization not numerically accurate
+* [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation)
+* FIXME
 :::
 
 ::: {.column width="50%"}
-| Screen      |  Pixels |
-|------------:|--------:|
-| VGA         |    0.3M |
-| Full HD     |    2.1M |
-| MacBook 13' |    4.1M |
-| 4K          |    8.8M |
-| 8K          |   35.4M |
+FIXME show plot of a soliton
 :::
 
 ::::
 
-It's **impossible** to look at the entire globe in full resolution.
+# Our example problem
+FIXME
+show some central loop
+
+# Decomposing problem domains
+## Our problem domain
+FIXME
+
+## Other problem domains
+* ICONs domain decomp
+* maybe something totally different?
+
+FIXME
+
+# Introducing OpenMP
+* A popular way to parallelize code
+* Pragma-based parallelization API
+  * You annotate your code with parallel regions and the compiler does the rest
+* OpenMP uses something called threads
+  * Wait until next week for a definition
+
+```c
+#pragma omp parallel for
+    for (int i = 0; i < N; ++i)
+        a[i] = 2 * i;
+```
+
+## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
+
+1. Compile and run the example serially. Use `time ./serial.x` to time the execution.
+23. Compile and run the example using OpenMP. Use `OMP_NUM_THREADS=2 time ./omp.x` to time the execution.
+42. Now add
+   * `schedule(static,1)`
+   * `schedule(static,10)`
+   * `schedule(FIXMEsomethingelse)`
+   * `schedule(FIXMEsomethingelse)`
+and find out how the OpenMP runtime decomposes the problem domain.
+
+# Reductions FIXME title should be more generic
+## What is happening here?
+```c
+    int a[] = {2, 4, 6};
+    for (int i = 0; i < N; ++i)
+        sum = sum + a[i];
+```
+## What is happening here?
+```c
+    int a[] = {2, 4, 6};
+#pragma omp parallel for
+    for (int i = 0; i < N; ++i)
+        sum = sum + a[i];
+```
+[comment]: # (Can something go wrong?)
+
+## Solution
+```c
+    int a[] = {2, 4, 6};
+#pragma omp parallel for reduction(+:sum)
+    for (int i = 0; i < N; ++i)
+        sum = sum + a[i];
+```
+
+# Doing stuff wrong
+## What is going wrong here?
+```c
+    temp = 0;
+#pragma omp parallel for
+    for (int i = 0; i < N; ++i) {
+        temp = 2 * a[i];
+        b[i] = temp + 4;
+    }
+```
+## Solution
+```c
+    temp = 0;
+#pragma omp parallel for private(temp)
+    for (int i = 0; i < N; ++i) {
+        temp = 2 * a[i];
+        b[i] = temp + 4;
+    }
+```
+The problem is called "data race".
 
+## Other common errors
+* Race conditions
+  * The outcome of a program depends on the relative timing of multiple threads.
+* Deadlocks
+  * Multiple threads wait for a resource that cannot be fulfilled.
+* Inconsistency (FIXME: What is that?)
 
-## Load data at the resolution necessary for the analysis
+# Finally: A definition of parallelism
+"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously."
+Wikipedia
+FIXME: Citation correct?!
 
-![[GÃ³rski et al, 2022: The HEALPix primer](https://healpix.jpl.nasa.gov/pdf/intro.pdf)](https://easy.gems.dkrz.de/_images/gorski_f1.jpg)
+## Types of parallelism
+* Data parallelism (what we've been discussing)
+* Task parallelism (Example: Atmosphere ocean coupling)
+* Instruction level parallelism (see next week)
 
-## Highlight! {background-color=var(--dark-bg-color)}
+# FIXME
+* Homework:
+    * Do something where you run into hardware-constraints (i.e. Numa, too many threads, ...)
+    * Give some example with race condition or stuff and have them find it.
+* Add maybe:
+    * Are there theoretical concept like Amdahl, which we should explain? (I don't like Amdahl)
+    * Strong/weak scaling?
 
-* This slide is either important or has a special purpose.
-* You can use it to ask the audience a question or to start a hands-on session.
 
 ## Leftovers from previous talk
 Best practices for efficient scaling:
-- 
GitLab