diff --git a/_quarto.yml b/_quarto.yml index babb008aa838bb155284c1fb755a89c3734c85d8..1d12eb26daa5455f691e66652fba2488ff5c2b65 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -40,7 +40,7 @@ website: - "lectures/git2/slides.qmd" - "lectures/parallelism/slides.qmd" - "lectures/hardware/slides.qmd" - # - "lectures/file-and-data-systems/slides.qmd" + - "lectures/file-and-data-systems/slides.qmd" # - "lectures/memory-hierarchies/slides.qmd" # - "lectures/student-talks/slides.qmd" - section: "Exercises" @@ -56,7 +56,7 @@ website: - "exercises/git2/exercise.qmd" - "exercises/parallelism/parallelism.qmd" - "exercises/hardware/hardware.qmd" - # - "exercises/file_and_data_systems.qmd" + - "exercises/file-and-data-systems.qmd" # - "exercises/memory_hierarchies.qmd" # - "exercises/student_talks.qmd" diff --git a/exercises/file-and-data-systems.qmd b/exercises/file-and-data-systems.qmd new file mode 100644 index 0000000000000000000000000000000000000000..03181241ae21d76a14e374d78e52955563c3dc22 --- /dev/null +++ b/exercises/file-and-data-systems.qmd @@ -0,0 +1,20 @@ +--- +title: "File and Data Systems" +--- + +Create + +* 10000 files of 1 MB each +* 100 files of 100 MB each +* 1 file of 10 000 MB + +in your `/scratch/` directory on levante. + +* Read them once and measure the durations for the read operations +* Read every file 10x and measure the durations for the read operations +* Repeat the read part of the exercise after a few hours. + +Discuss your results. + +Create a directory `/dev/shm/$USER` (replace `$USER` with your user ID) and repeat the exercise in there - you will probably need about 100 GB RAM to do this (or simply compute node) - no delayed repeat needed here (why?). + diff --git a/lectures/file-and-data-systems/slides.qmd b/lectures/file-and-data-systems/slides.qmd new file mode 100644 index 0000000000000000000000000000000000000000..8d72f99f56b0b5dee4dd333db3afa9e010035376 --- /dev/null +++ b/lectures/file-and-data-systems/slides.qmd @@ -0,0 +1,464 @@ +--- +title: "File and Data Systems" +author: "Florian Ziemen and Karl-Hermann Wieners" +--- + +# Storing data + +* Recording of information in a medium + * Deoxyribonucleic acid (DNA) + * Hand writing + * Magnetic tapes + * Hard disks + +# Topics + +* Any storage is finite (except for /dev/null and /dev/zero). +* The pro's and con's of different storage media +* Indexing of storage / storage metadata +* Challenges of parallel storage access + +# Quota and permission + +## Quota +* Distributing a scarce resource between users. +* Every user / project gets a specified share. +* Usually no over-commitment. + +```bash +/sw/bin/lfsquota /work/bb1153 +``` +``` +Disk quotas for prj 30001639 (pid 30001639): + Filesystem used quota limit grace files quota limit grace + /work 588.4T 595T 595T - 22140190 0 0 - +``` + +## Permissions + +* Am I allowed to read / write a file? +* How about others? +* See `man ls` and `man chmod` for details for a standard file system. +* Other storage systems can have varying ways of controlling access. + +# Properties of storage systems + +## Latency +* How long does it take until we get the first bit of data? +* Crucial when opening many small files (e.g. starting python) +* Less crucial when reading one big file start-to-end. +* Largely determined by moving parts in the storage medium. + +## Continuous read / write + +* How much data do we get per second when we are reading continuously? +* Important for reading/writing large blobs of data from/into individual files. + +## Random read / write + +* Mixture of latency and continuous read/write +* Reading many small files / skipping around in files + +## Caching +* Keeping data *in memory* for frequent re-use. +* Usually storage media like disks have small caches with better properties. +* e.g. HDD of 16 TB with 512 MB of RAM cache. +* Operating systems also cache reads. +* Caching writes in RAM is trouble because of the risk of data loss due to power loss / system crash. + +# Hardware types + + + +## Speed vs cost per space + +| Device | Latency | Cont. R/W | Rand. R/W | EUR/TB | +|-| - | - | - | - | +| RAM | 10s of ns | 10s of GB/s | 10s of GB/s | ~ 3000 | +| SSD | 100s of $\mu$s | GB/s | GB/s | ~ 100 | +| HDD | ms | 200 MB/s | MB/s | ~ 10 | +| Tape | minutes | 300 MB/s | minimal | ~ 5 | + + +* All figures based on a quick google search in 06/2024. +* RAM needs electricity to keep the data (*volatile memory*). +* All but tape usually remain powered in an HPC. + +## RAM disk + +* Use RAM as if it were a disk + * `tmpfs` filesystems on levante (`/dev/shm`) +* High speed, low volume, lost on reboot. + +## Solid-state disk/flash drives + +* Non-volatile electronic medium. + - Keeps state (almost) without energy supply. +* High speed, also under random access. + +## Hard disk + +* Basically a stack of modern record players. +* Stack of magnetic disks with read/write heads. +* Spinning to make every point accessible by heads. +* Good for bigger files, not ideal for random access. + +## Tape + +* Spool of magnetizable bands. +* Serialized access only. +* Used for backup / long-term storage. + +## Hands-on {.handson} +::: {.smaller} +{{< embed timer.ipynb echo=true >}} +::: +Take this set of calls, and measure the write speed for different file sizes on your `/scratch/` + +# Storage Architectures + +> These aren't books in which events of the past are pinned like so many butterflies to a cork. These are the books from which history is derived. There are more than twenty thousand of them; each one is ten feet high, bound in lead, and the letters are so small that they have to be read with a magnifying glass. + +from "Small Gods" by Terry Pratchett + +## Thou shalt have identifiable data + +* There must be a way to reference the data stored on a medium +* Usual means are (symbolic) names or (numerical) identifiers +* Must be determined at time of data storage +* Either implicit, stored with the data or externally +* Medium is "formatted" to provide required infrastructure + +## The more the merrier + +* Additional information (metadata) may be needed + * Required by the storage architecture + * Support optimized data storage or access + * Defined by users or applications + * Allows indexing of data beyond name or id + * Especially for _Content-Adressed Storage_[^1] [(git2)](/lecture-materials/lectures/git2/slides.html#content-addressable-store) + +[^1]: not the same as _Content-Adressable Memory_ [(data structures)](/lecture-materials/lectures/data-structures/slides.html#dictionaries) + +# File systems (POSIX) + +* Data is organized in "Files" +* Files are grouped in special files, "Directories" +* File data is stored in fixed size blocks +* Focus on consistently managing changing data + +## Blocks + +* Minimal size of data transfer +* Reduce effect of latency by random access +* Read-ahead for sequential processing +* "Sweet spot" between single bytes and big blocks +* Usually a multiple of _device blocks_ + +## Inodes + +* The file's name refers to a metadata table ("inode") + * unique numerical identifier + * times of state change, permissions + * contains actual block locations + +:::{.fragment} +```{.bash code-line-numbers=false} +stat slides.qmd +``` +::: +:::{.fragment} +``` + File: slides.qmd + Size: 6268 Blocks: 16 IO Block: 4194304 regular file +Device: 84f0b5a2h/2230367650d Inode: 144131850904906407 Links: 1 +Access: (0644/-rw-r--r--) Uid: (20472/ m221078) Gid: (32054/ mpiscl) +Access: 2024-06-21 15:01:18.000000000 +0200 +Modify: 2024-06-21 15:01:18.000000000 +0200 +Change: 2024-06-21 17:29:10.000000000 +0200 + Birth: 2024-06-21 15:01:18.000000000 +0200 +``` +::: + + +## File system in action + +```{.bash code-line-numbers=false} +stat --file-system . +``` +:::{.fragment} +``` + File: "." + ID: 84f0b5a200000000 Namelen: 255 Type: lustre +Block size: 4096 Fundamental block size: 4096 +Blocks: Total: 31118373528 Free: 22547214420 Available: 20963365466 +Inodes: Total: 2465930855 Free: 2227421415 +``` +::: +:::{.fragment} +```{.bash code-line-numbers=false} +df --block-size=4096 . +``` +::: +:::{.fragment} +``` +Filesystem 4K-blocks Used Available Use% Mounted on +10.128.100.149@o2ib2[...]:/home/home 31118373528 8571074099 20963485307 30% /home +``` +::: +:::{.fragment} +```{.bash code-line-numbers=false} +df --inodes . +``` +::: +:::{.fragment} +``` +Filesystem Inodes IUsed IFree IUse% Mounted on +10.128.100.149@o2ib2[...]:/home/home 2465932215 238493275 2227438940 10% /home +``` +::: + +## Directories + +* Directories link names to inode numbers,<br> + ie. they do not "contain" files +* Links to itself (`.`) and the "parent" directory (`..`) +* UNIX has "root" directory (`/`) as global anchor +* Files identified via path in directory tree +* Tree may embed ("mount") different file systems + +## Directories in action + +```{.bash code-line-numbers=false} +pwd +``` +:::{.fragment} +``` +/home/[...]/generic_software_skills/lecture-materials/lectures/file-and-data-systems +``` +::: +:::{.fragment} +```{.bash code-line-numbers=false} +ls -lia +``` +::: +:::{.fragment} +``` +total 24 +144131850904906406 drwxr-xr-x 3 m221078 mpiscl 4096 Jun 21 15:01 . +144131850904906358 drwxr-xr-x 16 m221078 mpiscl 4096 Jun 21 15:01 .. +144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd +144131850904906408 drwxr-xr-x 2 m221078 mpiscl 4096 Jun 21 15:01 static +144131850904906413 -rw-r--r-- 1 m221078 mpiscl 1849 Jun 21 15:01 timer.ipynb +``` +::: + +## Links + +* An inode may have more than one name ("hard links") + * More than one directory with same name + * More than one name in one directory +* inode's life time managed by "link count" + * Link count 0 labels inode and blocks as recyclable + +## Links in action + +```{.bash code-line-numbers=false} +ln slides.qmd same_same_but_different +``` +:::{.fragment} +``` +144131850904906407 -rw-r--r-- 2 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different +144131850904906407 -rw-r--r-- 2 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd +``` +::: +:::{.fragment} +```{.bash code-line-numbers=false} +rm slides.qmd +``` +::: +:::{.fragment} +``` +144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different +``` +::: +:::{.fragment} +```{.bash code-line-numbers=false} +mv same_same_but_different slides.qmd +``` +::: +:::{.fragment} +``` +144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd +``` +::: + +## Fragmentation + +* Reading is fast if data stays "in line" +* Line breaks when adding blocks later<br> + or re-using blocks of unlinked inodes +* Slowing down due to "jumps" across medium +* Reduced by block range reservations ("Extents") +* _Defragmentation_ tools shuffle inodes but keep IDs + +## Hands-on {.handson} + +Add a similar function to the previous one for reading, and read the data you just produced. What's your throughput? + + + + +# Other architectures + +## Object storage + +* Data is presented as immutable "objects" ("BLOB") +* Each object has a globally unique identifier (eg. UUID or hash) +* Objects may be assigned names, grouped in "buckets" +* Generally supports creation ("put") and retrieval ("get"), can support much more (versioning, etc). +* Focus on data distribution and replication, fast read access + + +## Object storage -- metadata + +* Object metadata stored independent from data location + * Provide object migration or replication across devices + * Allows optimizations (eg. database, indices) + * Metadata, too, may be replicated +* Enables additional or user defined metadata + +## Object storage -- applications + +* Cloud storage services +* Parallelized file systems + +## In-file storage systems + +* Explicit: zip, tar, ... +* Implicit: HDF5, mp4, databases +* Pseudo-Filesystems: git, ... + +## In-file storage -- features + +* Counterbalance latency +* Decrease blocking loss +* Use storage features in application +* Portable across storage systems + + +# Redundancy + +Protection against + +* accidental deletion +* data loss due to hardware failure +* downtimes due to hardware failure + +## Backups +* Keep old states of the file system available. +* Need at least as much space as the (compressed version of the) data being backuped. +* Often low-freq full backups and hi-freq incremental backups + to balance space requirements and restoring time +* Ideally at different locations +* Automate them! + +## RAID + +Combining multiple harddisks into bigger / more secure combinations - often at controller level. + +* RAID 0 distributes the blocks across all disks - more space, but data loss if one fails. +* RAID 1 mirrors one disk on an identical copy. +* RAID 5 is similar to 0, but with one extra disk for (distributed) parity info +* RAID 6 is similar to 5, but with two extra disks for parity info (levante uses 8+2 disks). + + +## Erasure coding + +Similar to raid, but more flexible with the numbers of disks (more than two *parity* disks are possible). + +* Used in object stores. +* Usually, data is distributed across independent servers for higher availability. +* Requires more computational resources than RAID. + +# Lustre as a parallel file system + +*What if you are not the only one controlling the FS?* + +. . . + +The file system becomes an independent system. + +## File system via network + +* All nodes see the same set of files. +* A set of central servers manages the file system. +* All nodes accessing the lustre file system run local *clients*. +* Many nodes can write into the same file at the same time (MPI-IO). +* Optimized for high traffic volumes in large files. + +## Metadata and storage servers +* The index is spread over a group of Metadata servers (MDS, 8 for /work on levante). +* The files are spread over another group (40 OSS / 160 OST on levante). +* Every directory is tied to one MDS. +* A file is tied to one or more OSTs. +* An OST contains many hard disks. + +## The work file system of levante in context + + + + + +## Striping + CC-BY-4.0](static/gmd-13-3607-2020-f05-high-res.png){width=80%} + +## Striping -- features + +* Increased bandwidth by parallel reads + * Eventually limited by network interfaces +* More points of failure in one dataset + * Additional redundancy or error correction + + +# Shotgun buffet + +## IPFS (InterPlanetary File System) +* Content addressable storage on the internet (the hash of the data identifies a *file*. +* Distributed index tables. +* You either provide the data online or pay others to do so. +* Automated methods of caching and replication. + + +## fsspec +* Python package that provides pseudo-filesystems on various backends. +* Unifies access to various implementations. + +## Symbolic links (symlinks) + +_Symbolic_ links are inodes referring to _paths_ instead of data + +```bash +ln -s /path/to/happiness happy +``` +``` + File: happy -> /path/to/happiness + Size: 18 Blocks: 0 IO Block: 4096 symbolic link +``` + +## DNA encoding of data + +* Repeating nucleotides are problematic +* Using "Trits" referring to previous nucleotide + +| Previous | 0 | 1 | 2 | +| - | - | - | - | +| Thymine | A | C | G | +| Guanine | T | A | C | +| Cytosine | G | T | A | +| Adenine | C | G | T | + +# Further reading + +* John Harris, Remo Software: History of Storage from Cave Paintings to Electrons<br> + http://www.remosoftware.com/info/history-of-storage-from-cave-paintings-to-electrons/ diff --git a/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.pdf b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.pdf new file mode 100644 index 0000000000000000000000000000000000000000..b68280bddb92ab4cf23fe0705ae87649e94730c9 Binary files /dev/null and b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.pdf differ diff --git a/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png new file mode 100644 index 0000000000000000000000000000000000000000..d7f96fd9d4047fb7154358b19b0d57fce9782bff Binary files /dev/null and b/lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png differ diff --git a/lectures/file-and-data-systems/static/network-overview.png b/lectures/file-and-data-systems/static/network-overview.png new file mode 100644 index 0000000000000000000000000000000000000000..66308be8c22eef6ceabf60dc96137c8768a857da Binary files /dev/null and b/lectures/file-and-data-systems/static/network-overview.png differ diff --git a/lectures/file-and-data-systems/static/storage-media.jpg b/lectures/file-and-data-systems/static/storage-media.jpg new file mode 100644 index 0000000000000000000000000000000000000000..8182a60603acdd82b6deee122a8bb82bc18a0e6c Binary files /dev/null and b/lectures/file-and-data-systems/static/storage-media.jpg differ diff --git a/lectures/file-and-data-systems/timer.ipynb b/lectures/file-and-data-systems/timer.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..f7ccd3c636ef07d49eec531c8165b0984ce53898 --- /dev/null +++ b/lectures/file-and-data-systems/timer.ipynb @@ -0,0 +1,85 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "id": "87ef333a-e287-4138-bc63-339da6fe64a3", + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "import os\n", + "import timeit" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a32be89d-ebe9-43de-b259-0e96c11ecc56", + "metadata": {}, + "outputs": [], + "source": [ + "user = getpass.getuser()\n", + "destination = f\"/scratch/{user[0]}/{user}\"" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "bcf39570-a5bc-4af4-9354-06a801222826", + "metadata": {}, + "outputs": [], + "source": [ + "def run_write (path, length, blocksize=1024**2):\n", + " (times, remainder) = divmod(length, blocksize)\n", + " data = bytearray(blocksize)\n", + " with open(path, \"wb\") as of:\n", + " for i in range(times):\n", + " of.write(data)\n", + " if remainder:\n", + " of.write(bytearray(remainder))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "34a59ea9-52c9-4c3f-8664-a5a97882e5e0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'1GB': 0.4544727400643751}\n" + ] + } + ], + "source": [ + "duration = {}\n", + "duration['1GB'] = timeit.timeit(lambda : run_write(f\"{destination}/test.1GB\", (1024**3)), number = 1)\n", + "print(duration)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "0 Python 3 (based on the module python3/unstable", + "language": "python", + "name": "python3_unstable" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.10" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}