Skip to content
Snippets Groups Projects
Commit 842a6c97 authored by Dominik Zobel's avatar Dominik Zobel
Browse files

First draft for memory-hierarchies

parent 0cf8eafb
No related branches found
No related tags found
1 merge request!74Memory hierarchies lecture
...@@ -41,7 +41,7 @@ website: ...@@ -41,7 +41,7 @@ website:
- "lectures/parallelism/slides.qmd" - "lectures/parallelism/slides.qmd"
- "lectures/hardware/slides.qmd" - "lectures/hardware/slides.qmd"
- "lectures/file-and-data-systems/slides.qmd" - "lectures/file-and-data-systems/slides.qmd"
# - "lectures/memory-hierarchies/slides.qmd" - "lectures/memory-hierarchies/slides.qmd"
# - "lectures/student-talks/slides.qmd" # - "lectures/student-talks/slides.qmd"
- section: "Exercises" - section: "Exercises"
contents: contents:
......
---
title: "Memory Hierarchies"
author: "Dominik Zobel and Florian Ziemen"
---
# Memory Hierarchies
- Background
- Why you should care
- How to use it to your advantage
## Questions {.handson .incremental}
- Why not keep everything in memory?
- What to do with really big data?
- How to speed up processing data?
## Access time example (1/2) {.leftalign}
<!--
File `data.py`:
```{.python}
import time
import pickle
import numpy as np
def create_random_data():
np.random.seed(3922)
start = time.perf_counter()
data = np.random.randint(0, 2**20, size=(128, 128, 128, 128))
end = time.perf_counter()
print('{:10.5f}: Create "random" data'.format(end-start))
return data
def create_42_data():
start = time.perf_counter()
data = np.full((128, 128, 128, 128), 42)
end = time.perf_counter()
print('{:10.5f}: Create "42" data'.format(end-start))
return data
def store_data(filename, data, dataname):
start = time.perf_counter()
with open(filename, 'wb') as outfile:
pickle.dump(data, outfile)
end = time.perf_counter()
print('{:10.5f}: Store "{:s}" data'.format(end-start, dataname))
def load_data(filename, dataname):
start = time.perf_counter()
with open(filename, 'rb') as infile:
data = pickle.load(infile)
end = time.perf_counter()
print('{:10.5f}: Load "{:s}" data'.format(end-start, dataname))
return data
def operate_on_data(data, dataname):
start = time.perf_counter()
new_data = data + 1.1
end = time.perf_counter()
print('{:10.5f}: Operate on "{:s}" data'.format(end-start, dataname))
return new_data
```
File `save_it.py`:
```{.python}
from data import *
dataname = 'random'
data = create_random_data()
store_data(filename='temp01.dat',
data=data, dataname=dataname)
dataname = '42'
data = create_42_data()
store_data(filename='temp02.dat',
data=data, dataname=dataname)
```
File `from_disk.py`:
```{.python}
from data import *
dataname = 'random'
data = load_data(filename='temp01.dat',
dataname=dataname)
new_data = operate_on_data(data=data,
dataname=dataname)
dataname = '42'
data = load_data(filename='temp02.dat',
dataname=dataname)
new_data = operate_on_data(data=data,
dataname=dataname)
```
File `in_memory.py`:
```{.python}
from data import *
dataname = 'random'
data = create_random_data()
new_data = operate_on_data(data=data,
dataname=dataname)
dataname = '42'
data = create_42_data()
new_data = operate_on_data(data=data,
dataname=dataname)
```
-->
| | Levante (Fixed) | Local (Fixed) | Levante (Random) | Local (Random) |
| -------------------------- | ----------------- | ----------------- | ----------------- | ----------------- |
| Create data | 0.5607 | 0.2280 | 1.6623 | 0.8759 |
| Load data | 0.7605 | 0.8767 | 0.7615 | 0.9225 |
| Store data | 2.2317 | 4.0427 | 2.2327 | 3.4378 |
| Process data | 0.7618 | 0.3792 | 0.7572 | 0.3747 |
:::{.smaller}
Time in seconds using a 2 GB numpy array ($128 \times 128 \times 128 \times 128$) either with a fixed number or random number in each entry
:::
## Access time example (2/2) {.leftalign}
- Reading/writing to file is rather expensive
- If necessary during computation, try doing it asynchronously
- If possible, keep data in memory
## Techniques
- Caching
- Prefetching
- Branch prediction
## Unavailable data
If data is not available in the current memory level
- register spilling (register -> cache)
- cache missing (cache -> main memory)
- page fault (main memory -> disk)
Miss description
## External Memory Model
balance block size to get from next level to latency it takes to get it
## Memory access time model (1/3)
| Cache | Access Time | Hit Ratio |
| ------ | ------------ | ---------- |
| $L_1$ | $T_1$ | $H_1$ |
| $L_2$ | $T_2$ | |
- Parallel and serial requests possible
## Memory access time model (2/3)
| Cache | Access Time | Hit Ratio |
| ------ | ------------ | ---------- |
| $L_1$ | $T_1$ | $H_1$ |
| $L_2$ | $T_2$ | $H_2$ |
| $L_3$ | $T_3$ | |
## Memory access time model (3/3)
- Average memory access time $T_{avg,p}$ for parallel access (processor connected to all caches)
:::{.smaller}
\begin{align}
T_{avg,p} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot T_2\\
&+ ((1-H_1)\cdot(1-H_2))\cdot T_3
\end{align}
:::
- Average memory access time $T_{avg,s}$ for serial access
:::{.smaller}
\begin{align}
T_{avg,s} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot(T_1+T_2)\\
&+ ((1-H_1)\cdot(1-H_2))\cdot(T_1+T_2+T_3)
\end{align}
:::
## Overview/Exercise
- Introduction from base of pyramid (file access)
- example with access from disk and directly from memory
- background on memory hierarchy with focus on today instead of history
- Working our way up the pyramid
- reference values for Levante CPU
- example with optimal and sub-optimal memory access, i.e. cache blocking (see `nproma`)
- openmp reduction (hand implementation), reference/continuation from parallelism lecture
## Observations
- Gap between processor and memory speeds.
Hierarchy needed because of discrepancy between speed of CPU and (main) memory
(include image)
- exploit accessing data and code stored close to each other (temporal and spatial locality)
## Memory Pyramid
:::{.r-stack}
![](static/pyramid01.png){.fragment width=70% fragment-index=1}
![](static/pyramid02.png){.fragment width=70% fragment-index=2}
![](static/pyramid03.png){.fragment width=70% fragment-index=3}
![](static/pyramid04.png){.fragment width=70% fragment-index=4}
![](static/pyramid05.png){.fragment width=70% fragment-index=5}
![](static/pyramid06.png){.fragment width=70% fragment-index=6}
:::
## Memory Mountain (1/2)
Describe what is done
- Influence of block size
- Influence of stride
- Measurements done on Levante compute node
## Memory Mountain (2/2)
:::{r-stack}
![](static/memory_mountain.png){width=70%}
:::
:::{.smaller}
$\approx$ Factor 20 between best and worst access
:::
## Different architectures
- Different caches are available
- Speed and size of caches varies
- Basic understanding helps in all cases
- Hardware-specific knowledge allows additional fine-tuning
# Resources {.leftalign}
- "Computer Systems: A Programmer's Perspective" by _R. Bryant_ and _D. O'Hallaron_, Pearson
lectures/memory-hierarchies/static/memory_mountain.png

238 KiB

lectures/memory-hierarchies/static/pyramid01.png

33.2 KiB

lectures/memory-hierarchies/static/pyramid02.png

50.5 KiB

lectures/memory-hierarchies/static/pyramid03.png

72.7 KiB

lectures/memory-hierarchies/static/pyramid04.png

93.5 KiB

lectures/memory-hierarchies/static/pyramid05.png

108 KiB

lectures/memory-hierarchies/static/pyramid06.png

169 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment