Merge branch 'file-and-data-systems' into 'main'

File and Data Systems See merge request !11

Merge branch 'file-and-data-systems' into 'main'
0cf8eafb · Florian Ziemen · d3345ad4 · fafdebe3 · 0cf8eafb · 0cf8eafb
Commit 0cf8eafb authored 10 months ago by Florian Ziemen
--- a/_quarto.yml
+++ b/_quarto.yml
@@ -40,7 +40,7 @@ website:
          - "lectures/git2/slides.qmd"
          - "lectures/parallelism/slides.qmd"
          - "lectures/hardware/slides.qmd"
-          # - "lectures/file-and-data-systems/slides.qmd"
+          - "lectures/file-and-data-systems/slides.qmd"
          # - "lectures/memory-hierarchies/slides.qmd"
          # - "lectures/student-talks/slides.qmd"
      - section: "Exercises"
@@ -56,7 +56,7 @@ website:
          - "exercises/git2/exercise.qmd"
          - "exercises/parallelism/parallelism.qmd"
          - "exercises/hardware/hardware.qmd"
-          # - "exercises/file_and_data_systems.qmd"
+          - "exercises/file-and-data-systems.qmd"
          # - "exercises/memory_hierarchies.qmd"
          # - "exercises/student_talks.qmd"


--- a/exercises/file-and-data-systems.qmd
+++ b/exercises/file-and-data-systems.qmd
+---
+title: "File and Data Systems"
+---
+
+Create
+
+* 10000 files of 1 MB each
+* 100 files of 100 MB each
+* 1 file of 10 000 MB
+
+in your `/scratch/` directory on levante.
+
+* Read them once and measure the durations for the read operations
+* Read every file 10x and measure the durations for the read operations
+* Repeat the read part of the exercise after a few hours.
+
+Discuss your results.
+
+Create a directory `/dev/shm/$USER` (replace `$USER` with your user ID) and repeat the exercise in there - you will probably need about 100 GB RAM to do this (or simply compute node) - no delayed repeat needed here (why?).
+
--- a/lectures/file-and-data-systems/slides.qmd
+++ b/lectures/file-and-data-systems/slides.qmd
+---
+title: "File and Data Systems"
+author: "Florian Ziemen and Karl-Hermann Wieners"
+---
+
+# Storing data
+
+* Recording of information in a medium
+  * Deoxyribonucleic acid (DNA)
+  * Hand writing
+  * Magnetic tapes
+  * Hard disks
+
+# Topics
+
+* Any storage is finite (except for /dev/null and /dev/zero).
+* The pro's and con's of different storage media
+* Indexing of storage / storage metadata
+* Challenges of parallel storage access
+
+# Quota and permission
+
+## Quota
+* Distributing a scarce resource between users.
+* Every user / project gets a specified share.
+* Usually no over-commitment.
+
+```bash
+/sw/bin/lfsquota /work/bb1153
+```
+```
+Disk quotas for prj 30001639 (pid 30001639):
+     Filesystem    used   quota   limit   grace   files   quota   limit   grace
+          /work  588.4T    595T    595T       - 22140190       0       0       -
+```
+
+## Permissions
+
+* Am I allowed to read / write a file?
+* How about others?
+* See `man ls` and `man chmod` for details for a standard file system.
+* Other storage systems can have varying ways of controlling access.
+
+# Properties of storage systems
+
+## Latency
+* How long does it take until we get the first bit of data?
+* Crucial when opening many small files (e.g. starting python)
+* Less crucial when reading one big file start-to-end.
+* Largely determined by moving parts in the storage medium.
+
+## Continuous read / write
+
+* How much data do we get per second when we are reading continuously?
+* Important for reading/writing large blobs of data from/into individual files.
+
+## Random read / write
+
+* Mixture of latency and continuous read/write
+* Reading many small files / skipping around in files
+
+## Caching
+* Keeping data *in memory* for frequent re-use.
+* Usually storage media like disks have small caches with better properties.
+* e.g. HDD of 16 TB with 512 MB of RAM cache.
+* Operating systems also cache reads.
+* Caching writes in RAM is trouble because of the risk of data loss due to power loss / system crash.
+
+# Hardware types 
+
+![](static/storage-media.jpg)
+
+## Speed vs cost per space
+
+| Device | Latency | Cont. R/W | Rand. R/W | EUR/TB |
+|-| - | - | - | - |
+| RAM    | 10s of ns | 10s of GB/s | 10s of GB/s | ~ 3000 |
+| SSD    | 100s of $\mu$s  | GB/s | GB/s | ~ 100 |
+| HDD    | ms      | 200 MB/s    | MB/s | ~ 10 |
+| Tape   | minutes | 300 MB/s    | minimal | ~ 5 |
+
+
+* All figures based on a quick google search in 06/2024.
+* RAM needs electricity to keep the data (*volatile memory*).
+* All but tape usually remain powered in an HPC.
+
+## RAM disk
+
+* Use RAM as if it were a disk
+  * `tmpfs` filesystems on levante (`/dev/shm`)
+* High speed, low volume, lost on reboot.
+
+## Solid-state disk/flash drives
+
+* Non-volatile electronic medium.
+  - Keeps state (almost) without energy supply.
+* High speed, also under random access.
+
+## Hard disk
+
+* Basically a stack of modern record players.
+* Stack of magnetic disks with read/write heads.
+* Spinning to make every point accessible by heads.
+* Good for bigger files, not ideal for random access.
+
+## Tape
+
+* Spool of magnetizable bands.
+* Serialized access only.
+* Used for backup / long-term storage.
+
+## Hands-on {.handson}
+::: {.smaller}
+{{< embed timer.ipynb echo=true  >}}
+:::
+Take this set of calls, and measure the write speed for different file sizes on your `/scratch/`
+
+# Storage Architectures
+
+> These aren't books in which events of the past are pinned like so many butterflies to a cork. These are the books from which history is derived. There are more than twenty thousand of them; each one is ten feet high, bound in lead, and the letters are so small that they have to be read with a magnifying glass.
+
+from "Small Gods" by Terry Pratchett
+
+## Thou shalt have identifiable data
+
+* There must be a way to reference the data stored on a medium
+* Usual means are (symbolic) names or (numerical) identifiers
+* Must be determined at time of data storage
+* Either implicit, stored with the data or externally
+* Medium is "formatted" to provide required infrastructure 
+
+## The more the merrier
+
+* Additional information (metadata) may be needed
+  * Required by the storage architecture
+    * Support optimized data storage or access
+  * Defined by users or applications
+    * Allows indexing of data beyond name or id
+    * Especially for _Content-Adressed Storage_[^1] [(git2)](/lecture-materials/lectures/git2/slides.html#content-addressable-store)
+
+[^1]: not the same as _Content-Adressable Memory_ [(data structures)](/lecture-materials/lectures/data-structures/slides.html#dictionaries)
+
+# File systems (POSIX)
+
+* Data is organized in "Files"
+* Files are grouped in special files, "Directories"
+* File data is stored in fixed size blocks
+* Focus on consistently managing changing data
+
+## Blocks
+
+* Minimal size of data transfer
+* Reduce effect of latency by random access
+* Read-ahead for sequential processing
+* "Sweet spot" between single bytes and big blocks
+* Usually a multiple of _device blocks_
+
+## Inodes
+
+* The file's name refers to a metadata table ("inode")
+  * unique numerical identifier
+  * times of state change, permissions
+  * contains actual block locations
+
+:::{.fragment}
+```{.bash code-line-numbers=false}
+stat slides.qmd
+```
+:::
+:::{.fragment}
+```
+  File: slides.qmd
+  Size: 6268      	Blocks: 16         IO Block: 4194304 regular file
+Device: 84f0b5a2h/2230367650d	Inode: 144131850904906407  Links: 1
+Access: (0644/-rw-r--r--)  Uid: (20472/ m221078)   Gid: (32054/  mpiscl)
+Access: 2024-06-21 15:01:18.000000000 +0200
+Modify: 2024-06-21 15:01:18.000000000 +0200
+Change: 2024-06-21 17:29:10.000000000 +0200
+ Birth: 2024-06-21 15:01:18.000000000 +0200
+```
+:::
+
+
+## File system in action
+
+```{.bash code-line-numbers=false}
+stat --file-system .
+```
+:::{.fragment}
+```
+  File: "."
+    ID: 84f0b5a200000000 Namelen: 255     Type: lustre
+Block size: 4096       Fundamental block size: 4096
+Blocks: Total: 31118373528 Free: 22547214420 Available: 20963365466
+Inodes: Total: 2465930855 Free: 2227421415
+```
+:::
+:::{.fragment}
+```{.bash code-line-numbers=false}
+df --block-size=4096 .
+```
+:::
+:::{.fragment}
+```
+Filesystem                             4K-blocks       Used   Available Use% Mounted on
+10.128.100.149@o2ib2[...]:/home/home 31118373528 8571074099 20963485307  30% /home
+```
+:::
+:::{.fragment}
+```{.bash code-line-numbers=false}
+df --inodes .
+```
+:::
+:::{.fragment}
+```
+Filesystem                               Inodes     IUsed      IFree IUse% Mounted on
+10.128.100.149@o2ib2[...]:/home/home 2465932215 238493275 2227438940   10% /home
+```
+:::
+
+## Directories
+
+* Directories link names to inode numbers,<br>
+  ie. they do not "contain" files
+* Links to itself (`.`) and the "parent" directory (`..`)
+* UNIX has "root" directory (`/`) as global anchor
+* Files identified via path in directory tree
+* Tree may embed ("mount") different file systems
+
+## Directories in action
+
+```{.bash code-line-numbers=false}
+pwd
+```
+:::{.fragment}
+```
+/home/[...]/generic_software_skills/lecture-materials/lectures/file-and-data-systems
+```
+:::
+:::{.fragment}
+```{.bash code-line-numbers=false}
+ls -lia
+```
+:::
+:::{.fragment}
+```
+total 24
+144131850904906406 drwxr-xr-x  3 m221078 mpiscl 4096 Jun 21 15:01 .
+144131850904906358 drwxr-xr-x 16 m221078 mpiscl 4096 Jun 21 15:01 ..
+144131850904906407 -rw-r--r--  1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
+144131850904906408 drwxr-xr-x  2 m221078 mpiscl 4096 Jun 21 15:01 static
+144131850904906413 -rw-r--r--  1 m221078 mpiscl 1849 Jun 21 15:01 timer.ipynb
+```
+:::
+
+## Links
+
+* An inode may have more than one name ("hard links")
+  * More than one directory with same name
+  * More than one name in one directory
+* inode's life time managed by "link count"
+  * Link count 0 labels inode and blocks as recyclable
+
+## Links in action
+
+```{.bash code-line-numbers=false}
+ln slides.qmd same_same_but_different
+```
+:::{.fragment}
+```
+144131850904906407 -rw-r--r--  2 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different
+144131850904906407 -rw-r--r--  2 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
+```
+:::
+:::{.fragment}
+```{.bash code-line-numbers=false}
+rm slides.qmd
+```
+:::
+:::{.fragment}
+```
+144131850904906407 -rw-r--r--  1 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different
+```
+:::
+:::{.fragment}
+```{.bash code-line-numbers=false}
+mv same_same_but_different slides.qmd
+```
+:::
+:::{.fragment}
+```
+144131850904906407 -rw-r--r--  1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
+```
+:::
+
+## Fragmentation
+
+* Reading is fast if data stays "in line"
+* Line breaks when adding blocks later<br>
+  or re-using blocks of unlinked inodes
+* Slowing down due to "jumps" across medium
+* Reduced by block range reservations ("Extents")
+* _Defragmentation_ tools shuffle inodes but keep IDs
+
+## Hands-on {.handson}
+
+Add a similar function to the previous one for reading, and read the data you just produced. What's your throughput?
+
+
+
+
+# Other architectures
+
+## Object storage
+
+* Data is presented as immutable "objects" ("BLOB")
+* Each object has a globally unique identifier (eg. UUID or hash)
+* Objects may be assigned names, grouped in "buckets"
+* Generally supports creation ("put") and retrieval ("get"), can support much more (versioning, etc).
+* Focus on data distribution and replication, fast read access
+
+
+## Object storage -- metadata
+
+* Object metadata stored independent from data location
+  * Provide object migration or replication across devices
+  * Allows optimizations (eg. database, indices)
+  * Metadata, too, may be replicated
+* Enables additional or user defined metadata
+
+## Object storage -- applications
+
+* Cloud storage services
+* Parallelized file systems
+
+## In-file storage systems
+
+* Explicit: zip, tar, ...
+* Implicit: HDF5, mp4, databases
+* Pseudo-Filesystems: git, ...
+
+## In-file storage -- features
+
+* Counterbalance latency
+* Decrease blocking loss
+* Use storage features in application
+* Portable across storage systems
+
+
+# Redundancy
+
+Protection against 
+
+* accidental deletion
+* data loss due to hardware failure
+* downtimes due to hardware failure
+
+## Backups
+* Keep old states of the file system available.
+* Need at least as much space as the (compressed version of the) data being backuped.
+* Often low-freq full backups and hi-freq incremental backups  
+  to balance space requirements and restoring time
+* Ideally at different locations
+* Automate them!
+
+## RAID
+
+Combining multiple harddisks into bigger / more secure combinations - often at controller level.
+
+* RAID 0 distributes the blocks across all disks - more space, but data loss if one fails.
+* RAID 1 mirrors one disk on an identical copy.
+* RAID 5 is similar to 0, but with one extra disk for (distributed) parity info
+* RAID 6 is similar to 5, but with two extra disks for parity info (levante uses 8+2 disks).
+
+
+## Erasure coding
+
+Similar to raid, but more flexible with the numbers of disks (more than two *parity* disks are possible).
+
+* Used in object stores.
+* Usually, data is distributed across independent servers for higher availability.
+* Requires more computational resources than RAID.
+
+# Lustre as a parallel file system
+
+*What if you are not the only one controlling the FS?*
+
+. . .
+
+The file system becomes an independent system.
+
+## File system via network
+
+* All nodes see the same set of files.
+* A set of central servers manages the file system.
+* All nodes accessing the lustre file system run local *clients*.
+* Many nodes can write into the same file at the same time (MPI-IO).
+* Optimized for high traffic volumes in large files.
+
+## Metadata and storage servers
+* The index is spread over a group of Metadata servers (MDS, 8 for /work on levante).
+* The files are spread over another group (40 OSS / 160 OST on levante).
+* Every directory is tied to one MDS.
+* A file is tied to one or more OSTs.
+* An OST contains many hard disks.
+
+## The work file system of levante in context
+
+![](static/network-overview.png)
+
+
+
+## Striping
+![[Zheng et al. 2020](https://doi.org/10.5194/gmd-13-3607-2020) CC-BY-4.0](static/gmd-13-3607-2020-f05-high-res.png){width=80%}
+
+## Striping -- features
+
+* Increased bandwidth by parallel reads
+  * Eventually limited by network interfaces
+* More points of failure in one dataset
+  * Additional redundancy or error correction
+
+
+# Shotgun buffet
+
+## IPFS (InterPlanetary File System)
+* Content addressable storage on the internet (the hash of the data identifies a *file*.
+* Distributed index tables.
+* You either provide the data online or pay others to do so.
+* Automated methods of caching and replication.
+
+
+## fsspec
+* Python package that provides pseudo-filesystems on various backends.
+* Unifies access to various implementations.
+
+## Symbolic links (symlinks)
+
+_Symbolic_ links are inodes referring to _paths_ instead of data
+
+```bash
+ln -s /path/to/happiness happy
+```
+```
+  File: happy -> /path/to/happiness
+  Size: 18        	Blocks: 0          IO Block: 4096   symbolic link
+```
+
+## DNA encoding of data
+
+* Repeating nucleotides are problematic
+* Using "Trits" referring to previous nucleotide
+
+| Previous | 0 | 1 | 2 |
+| - | - | - | - |
+| Thymine | A | C | G |
+| Guanine | T | A | C |
+| Cytosine | G | T | A |
+| Adenine | C | G | T |
+
+# Further reading
+
+* John Harris, Remo Software: History of Storage from Cave Paintings to Electrons<br>
+  http://www.remosoftware.com/info/history-of-storage-from-cave-paintings-to-electrons/
--- a/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.pdf
+++ b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.pdf
--- a/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png
+++ b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png
--- a/lectures/file-and-data-systems/static/network-overview.png
+++ b/lectures/file-and-data-systems/static/network-overview.png
--- a/lectures/file-and-data-systems/static/storage-media.jpg
+++ b/lectures/file-and-data-systems/static/storage-media.jpg
--- a/lectures/file-and-data-systems/timer.ipynb
+++ b/lectures/file-and-data-systems/timer.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "87ef333a-e287-4138-bc63-339da6fe64a3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import getpass\n",
+    "import os\n",
+    "import timeit"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "a32be89d-ebe9-43de-b259-0e96c11ecc56",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "user = getpass.getuser()\n",
+    "destination = f\"/scratch/{user[0]}/{user}\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "bcf39570-a5bc-4af4-9354-06a801222826",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def run_write (path, length, blocksize=1024**2):\n",
+    "    (times, remainder) = divmod(length, blocksize)\n",
+    "    data = bytearray(blocksize)\n",
+    "    with open(path, \"wb\") as of:\n",
+    "        for i in range(times):\n",
+    "            of.write(data)\n",
+    "        if remainder:\n",
+    "            of.write(bytearray(remainder))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "34a59ea9-52c9-4c3f-8664-a5a97882e5e0",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'1GB': 0.4544727400643751}\n"
+     ]
+    }
+   ],
+   "source": [
+    "duration = {}\n",
+    "duration['1GB'] = timeit.timeit(lambda : run_write(f\"{destination}/test.1GB\", (1024**3)), number = 1)\n",
+    "print(duration)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "0 Python 3 (based on the module python3/unstable",
+   "language": "python",
+   "name": "python3_unstable"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:code id:87ef333a-e287-4138-bc63-339da6fe64a3 tags:
+
+``` python
+import getpass
+import os
+import timeit
+```
+
+%% Cell type:code id:a32be89d-ebe9-43de-b259-0e96c11ecc56 tags:
+
+``` python
+user = getpass.getuser()
+destination = f"/scratch/{user[0]}/{user}"
+```
+
+%% Cell type:code id:bcf39570-a5bc-4af4-9354-06a801222826 tags:
+
+``` python
+def run_write (path, length, blocksize=1024**2):
+    (times, remainder) = divmod(length, blocksize)
+    data = bytearray(blocksize)
+    with open(path, "wb") as of:
+        for i in range(times):
+            of.write(data)
+        if remainder:
+            of.write(bytearray(remainder))
+```
+
+%% Cell type:code id:34a59ea9-52c9-4c3f-8664-a5a97882e5e0 tags:
+
+``` python
+duration = {}
+duration['1GB'] = timeit.timeit(lambda : run_write(f"{destination}/test.1GB", (1024**3)), number = 1)
+print(duration)
+```
+
+%% Output
+
+    {'1GB': 0.4544727400643751}