---
title: "Memory Hierarchies"
author: "Dominik Zobel and Florian Ziemen"
---

# Memory Hierarchies

 - Background
 - Why you should care
 - How to use it to your advantage



## Questions {.handson .incremental}

 - Why not keep everything in memory?
 - What to do with really big data?
 - How to speed up processing data?



## Processor speed vs. memory speed

![](static/speed.png){width=70%}

:::{.smaller}
Based on figure from "Computer Architecture" by _J. Hennessy_ and _D. Patterson_
:::



## Access time example (1/2) {.leftalign}

<!--
File `data.py`:

```{.python}
import time
import pickle
import numpy as np


def create_random_data():
    np.random.seed(3922)
    start = time.perf_counter()
    data = np.random.randint(0, 2**20, size=(128, 128, 128, 128))
    end = time.perf_counter()
    print('{:10.5f}:   Create "random" data'.format(end-start))
    return data


def create_42_data():
    start = time.perf_counter()
    data = np.full((128, 128, 128, 128), 42)
    end = time.perf_counter()
    print('{:10.5f}:   Create "42" data'.format(end-start))
    return data


def store_data(filename, data, dataname):
    start = time.perf_counter()
    with open(filename, 'wb') as outfile:
        pickle.dump(data, outfile)

    end = time.perf_counter()
    print('{:10.5f}:   Store "{:s}" data'.format(end-start, dataname))


def load_data(filename, dataname):
    start = time.perf_counter()
    with open(filename, 'rb') as infile:
        data = pickle.load(infile)

    end = time.perf_counter()
    print('{:10.5f}:   Load "{:s}" data'.format(end-start, dataname))
    return data


def operate_on_data(data, dataname):
    start = time.perf_counter()
    new_data = data + 1.1
    end = time.perf_counter()
    print('{:10.5f}:   Operate on "{:s}" data'.format(end-start, dataname))
    return new_data
```

File `save_it.py`:

```{.python}
from data import *

dataname = 'random'
data = create_random_data()
store_data(filename='temp01.dat',
    data=data, dataname=dataname)

dataname = '42'
data = create_42_data()
store_data(filename='temp02.dat',
    data=data, dataname=dataname)
```

File `from_disk.py`:

```{.python}
from data import * 

dataname = 'random'
data = load_data(filename='temp01.dat',
    dataname=dataname)
new_data = operate_on_data(data=data,
    dataname=dataname)

dataname = '42'
data = load_data(filename='temp02.dat',
    dataname=dataname)
new_data = operate_on_data(data=data,
    dataname=dataname)
```

File `in_memory.py`:

```{.python}
from data import *

dataname = 'random'
data = create_random_data()
new_data = operate_on_data(data=data,
    dataname=dataname)

dataname = '42'
data = create_42_data()
new_data = operate_on_data(data=data,
    dataname=dataname)
```
-->

|                            | Levante (Fixed)   | Local (Fixed)     | Levante (Random)  | Local (Random)    |
| -------------------------- | ----------------- | ----------------- | ----------------- | ----------------- |
| Create data                | 0.5607            | 0.2280            | 1.6623            | 0.8759            |
| Load data                  | 0.7605            | 0.8767            | 0.7615            | 0.9225            |
| Store data                 | 2.2317            | 4.0427            | 2.2327            | 3.4378            |
| Process data               | 0.7618            | 0.3792            | 0.7572            | 0.3747            |

:::{.smaller}
Time in seconds using a 2 GB numpy array ($128 \times 128 \times 128 \times 128$) either with a fixed number or random number in each entry
:::



## Access time example (2/2) {.leftalign}

 - Reading/writing to file is rather expensive
 - If necessary during computation, try doing it asynchronously
 - If possible, keep data in memory



## Techniques

 - Caching
 - Prefetching
 - Branch prediction



## Unavailable data

If data is not available in the current memory level

 - register spilling (register -> cache)
 - cache missing     (cache -> main memory)
 - page fault        (main memory -> disk)

Miss description



## External Memory Model

balance block size to get from next level to latency it takes to get it



## Memory access time model (1/3)

| Cache  | Access Time  | Hit Ratio  |
| ------ | ------------ | ---------- |
| $L_1$  | $T_1$        | $H_1$      |
| $L_2$  | $T_2$        |            |

 - Parallel and serial requests possible



## Memory access time model (2/3)

| Cache  | Access Time  | Hit Ratio  |
| ------ | ------------ | ---------- |
| $L_1$  | $T_1$        | $H_1$      |
| $L_2$  | $T_2$        | $H_2$      |
| $L_3$  | $T_3$        |            |



## Memory access time model (3/3)

 - Average memory access time $T_{avg,p}$ for parallel access (processor connected to all caches)

:::{.smaller}
\begin{align}
T_{avg,p} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot T_2\\
          &+ ((1-H_1)\cdot(1-H_2))\cdot T_3
\end{align}
:::

 - Average memory access time $T_{avg,s}$ for serial access

:::{.smaller}
\begin{align}
T_{avg,s} &= H_1 T_1 + ((1-H_1)\cdot H_2)\cdot(T_1+T_2)\\
          &+ ((1-H_1)\cdot(1-H_2))\cdot(T_1+T_2+T_3)
\end{align}
:::



## Overview/Exercise

 - Introduction from base of pyramid (file access)
    - example with access from disk and directly from memory
    - background on memory hierarchy with focus on today instead of history

 - Working our way up the pyramid
    - reference values for Levante CPU
    - example with optimal and sub-optimal memory access, i.e. cache blocking (see `nproma`)
    - openmp reduction (hand implementation), reference/continuation from parallelism lecture



## Observations

 - Gap between processor and memory speeds.
   Hierarchy needed because of discrepancy between speed of CPU and (main) memory
   (include image)

 - exploit accessing data and code stored close to each other (temporal and spatial locality)



## Memory Pyramid (upwards)


:::{.r-stack}

![](static/pyramid01.png){.fragment width=70% fragment-index=1}

![](static/pyramid02.png){.fragment width=70% fragment-index=2}

![](static/pyramid03.png){.fragment width=70% fragment-index=3}

![](static/pyramid04.png){.fragment width=70% fragment-index=4}

![](static/pyramid05.png){.fragment width=70% fragment-index=5}

![](static/pyramid06.png){.fragment width=70% fragment-index=6}

:::



## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

- Base frequency: 2.45 GHz


## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB


:::{.incremental}

 - L1 Cache about 3 times slower

:::


## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB
L1 Cache       ~1.1                32 KB

:::{.incremental}

 - Same factors for L2 and L3

:::


## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB
L1 Cache       ~1.1                32 KB
L2 Cache       ~3.3                512 KB
L3 Cache       ~12.8               32 MB

:::{.incremental}

 - 256 GB of main memory (default) with a theoretical memory bandwidth of ~200 GB/s

:::



## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB
L1 Cache       ~1.1                32 KB
L2 Cache       ~3.3                512 KB
L3 Cache       ~12.8               32 MB
Main Memory    60                  256 GB

:::{.incremental}

 - Fast Data as Flash based file system

:::



## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB
L1 Cache       ~1.1                32 KB
L2 Cache       ~3.3                512 KB
L3 Cache       ~12.8               32 MB
Main Memory    60                  256 GB
SSD            100                 1 PB

:::{.incremental}

 - File system at Levante ~130 PB, limited by quota for project

:::



## Memory Pyramid (downwards) - check numbers {auto-animate=true}

Order of magnitude on Levante (AMD EPYC 7763)

               Access time (ns)    Capacity
------------   -----------------   ---------
Register       ~0.4                1 KB
L1 Cache       ~1.1                32 KB
L2 Cache       ~3.3                512 KB
L3 Cache       ~12.8               32 MB
Main Memory    ~60                 256 GB
SSD            ~100                1 PB
Hard disk      ~3000               130 PB

:::{.incremental}

 - Tape backup also possible

:::



## Memory Mountain (1/2)

Describe what is done

 - Influence of block size
 - Influence of stride
 - Measurements done on Levante compute node



## Memory Mountain (2/2)

:::{r-stack}

![](static/memory_mountain.png){width=70%}

:::

:::{.smaller}
$\approx$ Factor 20 between best and worst access
:::



## Different architectures

 - Different caches are available
 - Speed and size of caches varies
 - Basic understanding helps in all cases
 - Hardware-specific knowledge allows additional fine-tuning



# Resources {.leftalign}

 - "Computer Systems: A Programmer's Perspective" by _R. Bryant_ and _D. O'Hallaron_, Pearson
 - "Computer Architecture" by _J. Hennessy_ and _D. Patterson_, O'Reilly