Skip to content
Snippets Groups Projects
Commit 0cf8eafb authored by Florian Ziemen's avatar Florian Ziemen
Browse files

Merge branch 'file-and-data-systems' into 'main'

File and Data Systems

See merge request !11
parents d3345ad4 fafdebe3
No related branches found
No related tags found
1 merge request!11File and Data Systems
Pipeline #71525 passed
......@@ -40,7 +40,7 @@ website:
- "lectures/git2/slides.qmd"
- "lectures/parallelism/slides.qmd"
- "lectures/hardware/slides.qmd"
# - "lectures/file-and-data-systems/slides.qmd"
- "lectures/file-and-data-systems/slides.qmd"
# - "lectures/memory-hierarchies/slides.qmd"
# - "lectures/student-talks/slides.qmd"
- section: "Exercises"
......@@ -56,7 +56,7 @@ website:
- "exercises/git2/exercise.qmd"
- "exercises/parallelism/parallelism.qmd"
- "exercises/hardware/hardware.qmd"
# - "exercises/file_and_data_systems.qmd"
- "exercises/file-and-data-systems.qmd"
# - "exercises/memory_hierarchies.qmd"
# - "exercises/student_talks.qmd"
......
---
title: "File and Data Systems"
---
Create
* 10000 files of 1 MB each
* 100 files of 100 MB each
* 1 file of 10 000 MB
in your `/scratch/` directory on levante.
* Read them once and measure the durations for the read operations
* Read every file 10x and measure the durations for the read operations
* Repeat the read part of the exercise after a few hours.
Discuss your results.
Create a directory `/dev/shm/$USER` (replace `$USER` with your user ID) and repeat the exercise in there - you will probably need about 100 GB RAM to do this (or simply compute node) - no delayed repeat needed here (why?).
---
title: "File and Data Systems"
author: "Florian Ziemen and Karl-Hermann Wieners"
---
# Storing data
* Recording of information in a medium
* Deoxyribonucleic acid (DNA)
* Hand writing
* Magnetic tapes
* Hard disks
# Topics
* Any storage is finite (except for /dev/null and /dev/zero).
* The pro's and con's of different storage media
* Indexing of storage / storage metadata
* Challenges of parallel storage access
# Quota and permission
## Quota
* Distributing a scarce resource between users.
* Every user / project gets a specified share.
* Usually no over-commitment.
```bash
/sw/bin/lfsquota /work/bb1153
```
```
Disk quotas for prj 30001639 (pid 30001639):
Filesystem used quota limit grace files quota limit grace
/work 588.4T 595T 595T - 22140190 0 0 -
```
## Permissions
* Am I allowed to read / write a file?
* How about others?
* See `man ls` and `man chmod` for details for a standard file system.
* Other storage systems can have varying ways of controlling access.
# Properties of storage systems
## Latency
* How long does it take until we get the first bit of data?
* Crucial when opening many small files (e.g. starting python)
* Less crucial when reading one big file start-to-end.
* Largely determined by moving parts in the storage medium.
## Continuous read / write
* How much data do we get per second when we are reading continuously?
* Important for reading/writing large blobs of data from/into individual files.
## Random read / write
* Mixture of latency and continuous read/write
* Reading many small files / skipping around in files
## Caching
* Keeping data *in memory* for frequent re-use.
* Usually storage media like disks have small caches with better properties.
* e.g. HDD of 16 TB with 512 MB of RAM cache.
* Operating systems also cache reads.
* Caching writes in RAM is trouble because of the risk of data loss due to power loss / system crash.
# Hardware types
![](static/storage-media.jpg)
## Speed vs cost per space
| Device | Latency | Cont. R/W | Rand. R/W | EUR/TB |
|-| - | - | - | - |
| RAM | 10s of ns | 10s of GB/s | 10s of GB/s | ~ 3000 |
| SSD | 100s of $\mu$s | GB/s | GB/s | ~ 100 |
| HDD | ms | 200 MB/s | MB/s | ~ 10 |
| Tape | minutes | 300 MB/s | minimal | ~ 5 |
* All figures based on a quick google search in 06/2024.
* RAM needs electricity to keep the data (*volatile memory*).
* All but tape usually remain powered in an HPC.
## RAM disk
* Use RAM as if it were a disk
* `tmpfs` filesystems on levante (`/dev/shm`)
* High speed, low volume, lost on reboot.
## Solid-state disk/flash drives
* Non-volatile electronic medium.
- Keeps state (almost) without energy supply.
* High speed, also under random access.
## Hard disk
* Basically a stack of modern record players.
* Stack of magnetic disks with read/write heads.
* Spinning to make every point accessible by heads.
* Good for bigger files, not ideal for random access.
## Tape
* Spool of magnetizable bands.
* Serialized access only.
* Used for backup / long-term storage.
## Hands-on {.handson}
::: {.smaller}
{{< embed timer.ipynb echo=true >}}
:::
Take this set of calls, and measure the write speed for different file sizes on your `/scratch/`
# Storage Architectures
> These aren't books in which events of the past are pinned like so many butterflies to a cork. These are the books from which history is derived. There are more than twenty thousand of them; each one is ten feet high, bound in lead, and the letters are so small that they have to be read with a magnifying glass.
from "Small Gods" by Terry Pratchett
## Thou shalt have identifiable data
* There must be a way to reference the data stored on a medium
* Usual means are (symbolic) names or (numerical) identifiers
* Must be determined at time of data storage
* Either implicit, stored with the data or externally
* Medium is "formatted" to provide required infrastructure
## The more the merrier
* Additional information (metadata) may be needed
* Required by the storage architecture
* Support optimized data storage or access
* Defined by users or applications
* Allows indexing of data beyond name or id
* Especially for _Content-Adressed Storage_[^1] [(git2)](/lecture-materials/lectures/git2/slides.html#content-addressable-store)
[^1]: not the same as _Content-Adressable Memory_ [(data structures)](/lecture-materials/lectures/data-structures/slides.html#dictionaries)
# File systems (POSIX)
* Data is organized in "Files"
* Files are grouped in special files, "Directories"
* File data is stored in fixed size blocks
* Focus on consistently managing changing data
## Blocks
* Minimal size of data transfer
* Reduce effect of latency by random access
* Read-ahead for sequential processing
* "Sweet spot" between single bytes and big blocks
* Usually a multiple of _device blocks_
## Inodes
* The file's name refers to a metadata table ("inode")
* unique numerical identifier
* times of state change, permissions
* contains actual block locations
:::{.fragment}
```{.bash code-line-numbers=false}
stat slides.qmd
```
:::
:::{.fragment}
```
File: slides.qmd
Size: 6268 Blocks: 16 IO Block: 4194304 regular file
Device: 84f0b5a2h/2230367650d Inode: 144131850904906407 Links: 1
Access: (0644/-rw-r--r--) Uid: (20472/ m221078) Gid: (32054/ mpiscl)
Access: 2024-06-21 15:01:18.000000000 +0200
Modify: 2024-06-21 15:01:18.000000000 +0200
Change: 2024-06-21 17:29:10.000000000 +0200
Birth: 2024-06-21 15:01:18.000000000 +0200
```
:::
## File system in action
```{.bash code-line-numbers=false}
stat --file-system .
```
:::{.fragment}
```
File: "."
ID: 84f0b5a200000000 Namelen: 255 Type: lustre
Block size: 4096 Fundamental block size: 4096
Blocks: Total: 31118373528 Free: 22547214420 Available: 20963365466
Inodes: Total: 2465930855 Free: 2227421415
```
:::
:::{.fragment}
```{.bash code-line-numbers=false}
df --block-size=4096 .
```
:::
:::{.fragment}
```
Filesystem 4K-blocks Used Available Use% Mounted on
10.128.100.149@o2ib2[...]:/home/home 31118373528 8571074099 20963485307 30% /home
```
:::
:::{.fragment}
```{.bash code-line-numbers=false}
df --inodes .
```
:::
:::{.fragment}
```
Filesystem Inodes IUsed IFree IUse% Mounted on
10.128.100.149@o2ib2[...]:/home/home 2465932215 238493275 2227438940 10% /home
```
:::
## Directories
* Directories link names to inode numbers,<br>
ie. they do not "contain" files
* Links to itself (`.`) and the "parent" directory (`..`)
* UNIX has "root" directory (`/`) as global anchor
* Files identified via path in directory tree
* Tree may embed ("mount") different file systems
## Directories in action
```{.bash code-line-numbers=false}
pwd
```
:::{.fragment}
```
/home/[...]/generic_software_skills/lecture-materials/lectures/file-and-data-systems
```
:::
:::{.fragment}
```{.bash code-line-numbers=false}
ls -lia
```
:::
:::{.fragment}
```
total 24
144131850904906406 drwxr-xr-x 3 m221078 mpiscl 4096 Jun 21 15:01 .
144131850904906358 drwxr-xr-x 16 m221078 mpiscl 4096 Jun 21 15:01 ..
144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
144131850904906408 drwxr-xr-x 2 m221078 mpiscl 4096 Jun 21 15:01 static
144131850904906413 -rw-r--r-- 1 m221078 mpiscl 1849 Jun 21 15:01 timer.ipynb
```
:::
## Links
* An inode may have more than one name ("hard links")
* More than one directory with same name
* More than one name in one directory
* inode's life time managed by "link count"
* Link count 0 labels inode and blocks as recyclable
## Links in action
```{.bash code-line-numbers=false}
ln slides.qmd same_same_but_different
```
:::{.fragment}
```
144131850904906407 -rw-r--r-- 2 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different
144131850904906407 -rw-r--r-- 2 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
```
:::
:::{.fragment}
```{.bash code-line-numbers=false}
rm slides.qmd
```
:::
:::{.fragment}
```
144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 same_same_but_different
```
:::
:::{.fragment}
```{.bash code-line-numbers=false}
mv same_same_but_different slides.qmd
```
:::
:::{.fragment}
```
144131850904906407 -rw-r--r-- 1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
```
:::
## Fragmentation
* Reading is fast if data stays "in line"
* Line breaks when adding blocks later<br>
or re-using blocks of unlinked inodes
* Slowing down due to "jumps" across medium
* Reduced by block range reservations ("Extents")
* _Defragmentation_ tools shuffle inodes but keep IDs
## Hands-on {.handson}
Add a similar function to the previous one for reading, and read the data you just produced. What's your throughput?
# Other architectures
## Object storage
* Data is presented as immutable "objects" ("BLOB")
* Each object has a globally unique identifier (eg. UUID or hash)
* Objects may be assigned names, grouped in "buckets"
* Generally supports creation ("put") and retrieval ("get"), can support much more (versioning, etc).
* Focus on data distribution and replication, fast read access
## Object storage -- metadata
* Object metadata stored independent from data location
* Provide object migration or replication across devices
* Allows optimizations (eg. database, indices)
* Metadata, too, may be replicated
* Enables additional or user defined metadata
## Object storage -- applications
* Cloud storage services
* Parallelized file systems
## In-file storage systems
* Explicit: zip, tar, ...
* Implicit: HDF5, mp4, databases
* Pseudo-Filesystems: git, ...
## In-file storage -- features
* Counterbalance latency
* Decrease blocking loss
* Use storage features in application
* Portable across storage systems
# Redundancy
Protection against
* accidental deletion
* data loss due to hardware failure
* downtimes due to hardware failure
## Backups
* Keep old states of the file system available.
* Need at least as much space as the (compressed version of the) data being backuped.
* Often low-freq full backups and hi-freq incremental backups
to balance space requirements and restoring time
* Ideally at different locations
* Automate them!
## RAID
Combining multiple harddisks into bigger / more secure combinations - often at controller level.
* RAID 0 distributes the blocks across all disks - more space, but data loss if one fails.
* RAID 1 mirrors one disk on an identical copy.
* RAID 5 is similar to 0, but with one extra disk for (distributed) parity info
* RAID 6 is similar to 5, but with two extra disks for parity info (levante uses 8+2 disks).
## Erasure coding
Similar to raid, but more flexible with the numbers of disks (more than two *parity* disks are possible).
* Used in object stores.
* Usually, data is distributed across independent servers for higher availability.
* Requires more computational resources than RAID.
# Lustre as a parallel file system
*What if you are not the only one controlling the FS?*
. . .
The file system becomes an independent system.
## File system via network
* All nodes see the same set of files.
* A set of central servers manages the file system.
* All nodes accessing the lustre file system run local *clients*.
* Many nodes can write into the same file at the same time (MPI-IO).
* Optimized for high traffic volumes in large files.
## Metadata and storage servers
* The index is spread over a group of Metadata servers (MDS, 8 for /work on levante).
* The files are spread over another group (40 OSS / 160 OST on levante).
* Every directory is tied to one MDS.
* A file is tied to one or more OSTs.
* An OST contains many hard disks.
## The work file system of levante in context
![](static/network-overview.png)
## Striping
![[Zheng et al. 2020](https://doi.org/10.5194/gmd-13-3607-2020) CC-BY-4.0](static/gmd-13-3607-2020-f05-high-res.png){width=80%}
## Striping -- features
* Increased bandwidth by parallel reads
* Eventually limited by network interfaces
* More points of failure in one dataset
* Additional redundancy or error correction
# Shotgun buffet
## IPFS (InterPlanetary File System)
* Content addressable storage on the internet (the hash of the data identifies a *file*.
* Distributed index tables.
* You either provide the data online or pay others to do so.
* Automated methods of caching and replication.
## fsspec
* Python package that provides pseudo-filesystems on various backends.
* Unifies access to various implementations.
## Symbolic links (symlinks)
_Symbolic_ links are inodes referring to _paths_ instead of data
```bash
ln -s /path/to/happiness happy
```
```
File: happy -> /path/to/happiness
Size: 18 Blocks: 0 IO Block: 4096 symbolic link
```
## DNA encoding of data
* Repeating nucleotides are problematic
* Using "Trits" referring to previous nucleotide
| Previous | 0 | 1 | 2 |
| - | - | - | - |
| Thymine | A | C | G |
| Guanine | T | A | C |
| Cytosine | G | T | A |
| Adenine | C | G | T |
# Further reading
* John Harris, Remo Software: History of Storage from Cave Paintings to Electrons<br>
http://www.remosoftware.com/info/history-of-storage-from-cave-paintings-to-electrons/
File added
lectures/file-and-data-systems/static/gmd-13-3607-2020-f05-high-res.png

113 KiB

lectures/file-and-data-systems/static/network-overview.png

121 KiB

lectures/file-and-data-systems/static/storage-media.jpg

720 KiB

%% Cell type:code id:87ef333a-e287-4138-bc63-339da6fe64a3 tags:
``` python
import getpass
import os
import timeit
```
%% Cell type:code id:a32be89d-ebe9-43de-b259-0e96c11ecc56 tags:
``` python
user = getpass.getuser()
destination = f"/scratch/{user[0]}/{user}"
```
%% Cell type:code id:bcf39570-a5bc-4af4-9354-06a801222826 tags:
``` python
def run_write (path, length, blocksize=1024**2):
(times, remainder) = divmod(length, blocksize)
data = bytearray(blocksize)
with open(path, "wb") as of:
for i in range(times):
of.write(data)
if remainder:
of.write(bytearray(remainder))
```
%% Cell type:code id:34a59ea9-52c9-4c3f-8664-a5a97882e5e0 tags:
``` python
duration = {}
duration['1GB'] = timeit.timeit(lambda : run_write(f"{destination}/test.1GB", (1024**3)), number = 1)
print(duration)
```
%% Output
{'1GB': 0.4544727400643751}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment