Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
L
lecture materials
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
generic software skills
lecture materials
Commits
4a143f1e
Commit
4a143f1e
authored
11 months ago
by
Jan Frederik Engels
Browse files
Options
Downloads
Patches
Plain Diff
Some stuff.
parent
072dd2ea
No related branches found
No related tags found
1 merge request
!72
Draft: Compute devices lecture
Pipeline
#69469
passed
11 months ago
Stage: test
Stage: build
Changes
1
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
lectures/hardware/slides.qmd
+240
-30
240 additions, 30 deletions
lectures/hardware/slides.qmd
with
240 additions
and
30 deletions
lectures/hardware/slides.qmd
+
240
−
30
View file @
4a143f1e
---
title: "Example lecture"
author: "Tobias Kölling, Florian Ziemen, and the teams at MPI-M and DKRZ"
title: "Computing devices"
subtitle: "Complications due to existing hardware"
author: "CF, GM, JFE FIXME"
---
# Preface
# Recap from last week
* What is parallelism?
* Which type of parallelism did we discuss?
* What is actually done in parallel?
* Which technique did we use for parallelisation?
* This is an example lecture for the generic computing skills course
# Limits of OpenMP
* Can't scale one node
* Can get complicated if architecture is complicated
* FIXME: What else?
## Idea
*optimize output for **analysis** *
The technique is called "shared-memory parallelism".
::: {.smaller}
(not write throughput)
:::
# Beyond shared memory
If we want to scale beyond one node, what do we need in our example?
FIXME show domain decomp, ask about last lecture with reductions in mind,
show discretisation which points to boundary-data exchange
# Beyond shared memory
If we want to scale beyond one node, what do we need in our example?
* Interconnect between nodes (Kategorienfehler?)
* Boundary-data exchange
* Reductions
# Distributed memory parallelism
* The dominating standard is MPI (Message Passing Interface)
* An MPI-Program has multiple **ranks** which can run on different nodes.
* Point-to-Point mechanism transfers data ("messages") between two ranks.
* Often used for boundary exchange
* A lot of the details are beyond the scope of this lecture.
## Collectives
* Acts collectively on all ranks.
* Can do:
* Broadcast
* Gather
* Scatter
* Reductions
* All-to-all (why is all-to-all a bad idea for boundary exchange?)
## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
* Find the maximum height of the wave and print it
* Can you also think of a way to find the position of the peak?
## MPI+X
* MPI can be combined with shared memory paradigms, like OpenMP or OpenACC
* Current state of the art
# But what actually does the work?
Existing technologies
* CPU
* GPU
* (NEC Vector)
* ~~FPGA~~
* 🦄?
## Categorization
* MIMD: Multiple instruction, multiple data
* Multiple cores in a CPU-system
* ?
* SIMD: Single instruction, multiple data
* "Warps"? inside GPUs
* Vectorization? inside a CPU core
FIXME: Explain more, maybe graphically
# From the pad
Computing devices (complications by exisiting hardware)
- What you saw last lecture is called "shared memory parallelism"
- Can't scale beyond one node
- Show limit of OMP (shared memory)
- Introduce distrib-memory parallelism (MPI), ideally with the same example.
- Interconnect exists
- Skim over pt2pt and collectives? Or is that already too much/complex?
- Hybrid, i.e. MPI+OMP or MPI+X
- Hands-on:
- Grab something from the homework and have them do it distrib-mem or even hybrid.
- Do a little more on CPU architecture (what exactly? Especially when not touching memory hierarchy.)
- [GM] Should we discuss some basic concepts like SIMD/SPMD/Flynn's taxonomy?
- Only do MIMD and SIMD
## chunking & hierarchy
- GPUs. How?
- ~~~Point out the main differences (what is really important, what is "we've explained it like this since the 90s"?)~~~
- Explain GPU in terms of MIMD and SIMD
- Point out advantages and disadvantages
- Point out porting strategies
- There's OMP target and OpenACC and CUDA/HIP and GM is gonna add more cool stuff here
- Often memory is not shared between GPU and CPU, so you not only need to port the actual code, but also think about memory management.
- Best practice: Use as few mem transfer as possible (and motivate why)
- Here is links to ressources.
- Point out technical progress probably obsolecing a lot of knowledge and best practices
- Additional stuff:
- Interconnect: Latency bound vs bandwidth bound
- Obscure other architectures:
- NEC Vector
# Motivation
* We have a serial code and want to make it faster
* Plan of action:
* Cut problem into smaller pieces
* Use independent compute resources for each piece
* Outlook for next week: The individual computing element does no longer
get much faster, but there are more of them
* FIXME: What else?
## This lecture
* Is mostly about parallelism as a concept
* Next week: Hardware using this concept
# Introduce parallelism
Thinking about it, I think we should not give a theoretical definition here,
but first give the example and explain parallelism there. Eventually, with the
task-parallelism we should probably give a real definition and different flavours.
FIXME
# Our example problem
:::: {.columns}
::: {.column width="50%"}
| Grid | Cells |
|---------:|--------:|
| 1° by 1° | 0.06M |
| 10 km | 5.1M |
| 5 km | 20M |
| 1 km | 510M |
| 200 m | 12750M |
* 1d Tsunami equation
* Korteweg–De Vries equation
* Discretization not numerically accurate
* [Wikipedia](https://en.wikipedia.org/wiki/Korteweg%E2%80%93De_Vries_equation)
* FIXME
:::
::: {.column width="50%"}
| Screen | Pixels |
|------------:|--------:|
| VGA | 0.3M |
| Full HD | 2.1M |
| MacBook 13' | 4.1M |
| 4K | 8.8M |
| 8K | 35.4M |
FIXME show plot of a soliton
:::
::::
It's **impossible** to look at the entire globe in full resolution.
# Our example problem
FIXME
show some central loop
# Decomposing problem domains
## Our problem domain
FIXME
## Other problem domains
* ICONs domain decomp
* maybe something totally different?
FIXME
# Introducing OpenMP
* A popular way to parallelize code
* Pragma-based parallelization API
* You annotate your code with parallel regions and the compiler does the rest
* OpenMP uses something called threads
* Wait until next week for a definition
```c
#pragma omp parallel for
for (int i = 0; i < N; ++i)
a[i] = 2 * i;
```
## Hands-on Session! {background-color=var(--dark-bg-color) .leftalign}
1. Compile and run the example serially. Use `time ./serial.x` to time the execution.
23. Compile and run the example using OpenMP. Use `OMP_NUM_THREADS=2 time ./omp.x` to time the execution.
42. Now add
* `schedule(static,1)`
* `schedule(static,10)`
* `schedule(FIXMEsomethingelse)`
* `schedule(FIXMEsomethingelse)`
and find out how the OpenMP runtime decomposes the problem domain.
# Reductions FIXME title should be more generic
## What is happening here?
```c
int a[] = {2, 4, 6};
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
## What is happening here?
```c
int a[] = {2, 4, 6};
#pragma omp parallel for
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
[comment]: # (Can something go wrong?)
## Solution
```c
int a[] = {2, 4, 6};
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < N; ++i)
sum = sum + a[i];
```
# Doing stuff wrong
## What is going wrong here?
```c
temp = 0;
#pragma omp parallel for
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
## Solution
```c
temp = 0;
#pragma omp parallel for private(temp)
for (int i = 0; i < N; ++i) {
temp = 2 * a[i];
b[i] = temp + 4;
}
```
The problem is called "data race".
## Other common errors
* Race conditions
* The outcome of a program depends on the relative timing of multiple threads.
* Deadlocks
* Multiple threads wait for a resource that cannot be fulfilled.
* Inconsistency (FIXME: What is that?)
## Load data at the resolution necessary for the analysis
# Finally: A definition of parallelism
"Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously."
Wikipedia
FIXME: Citation correct?!
](https://easy.gems.dkrz.de/_images/gorski_f1.jpg)
## Types of parallelism
* Data parallelism (what we've been discussing)
* Task parallelism (Example: Atmosphere ocean coupling)
* Instruction level parallelism (see next week)
## Highlight! {background-color=var(--dark-bg-color)}
# FIXME
* Homework:
* Do something where you run into hardware-constraints (i.e. Numa, too many threads, ...)
* Give some example with race condition or stuff and have them find it.
* Add maybe:
* Are there theoretical concept like Amdahl, which we should explain? (I don't like Amdahl)
* Strong/weak scaling?
* This slide is either important or has a special purpose.
* You can use it to ask the audience a question or to start a hands-on session.
## Leftovers from previous talk
Best practices for efficient scaling:
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment