This section summarizes the architecture and guiding principles of RSDOS.
-
Insertion
insert(single stream object)insert_many(multiple stream objects via an iterator)
Both methods store data into the “container.” Iterators help manage buffers when dealing with a large number of files, reducing memory overhead.
-
Extraction
extract(single object)extract_many(multiple objects via an iterator)
Both methods read data from the container, checking loose storage first, then packed storage.
-
Container Abstraction
- The container should implement
insert,insert_many,extract, andextract_manyregardless of its underlying storage (loose or packed). - Internally, an
enumstrategy distinguishes between loose and packed storage.
- The container should implement
-
Naming and Legacy Compatibility
- “loose” and “packed” are the primary terms;
packsremains valid for compatibility with legacy disk-objectstore.
- “loose” and “packed” are the primary terms;
-
Packing
packmoves objects from loose to packed storage. It usesinsert_manyfor efficiency and avoids repeated DB open/close overhead.repackon packed storage re-packs objects (vacuuming old data with incremental pack IDs).
-
Hash Keys
- Act as both unique IDs (using SHA-256 to avoid duplicates) and checksums to validate object integrity.
- A cheaper checksum can also be used to verify data integrity for already-identified objects.
-
Compression
- Supports both zlib and zstd (default).
- Metadata:
raw_sizeis the uncompressed size;sizeis the compressed size in a packed file.
- The Python API does not expose a context manager for containers because Rust will handle resource cleanup automatically.
- Each I/O call uses its own connection to the embedded DB (
sledin v2), allowing safe operations—even in non-blocking contexts (though this is untested). - From Python,
insertandinsert_manyalways write to loose storage;extractandextract_manysearch both loose and packed. packmoves objects from loose to packed, meaning objects might reside in both places afterward.
Below is a conceptual illustration of how bytes flow across Python and Rust boundaries:
RSDOS uses heuristics to decide if data is worth compressing, following recommendations from:
- When is it worth compressing?
- A discussion on compression trade-offs
- Btrfs pre-compression heuristics
The rough decision flow is:
- If a file is very small (e.g., < 850 bytes), do not compress.
- If the file already appears to be zlib/zstd-compressed (by reading the header bytes), do not compress (unless forced to recompress).
- Check the first 512 bytes. If they contain many null bytes (likely binary), treat them as
MaybeBinary. - Otherwise, treat them as large text (
MaybeLargeText) and compress if compression is enabled.
When any parsing or heuristic fails, default to “worth compressing.”
- Loose Storage remains the same. A directory named
packsis also recognized aspacked. - Compression:
- Legacy reads with zlib, new writes with zstd.
- On migration, you can re-insert everything into the new store to convert to zstd if desired.
- Config:
config.jsonnow includes extra fields; missing items use defaults. - Packed DB:
- Migrating from a legacy store requires reading all objects from the old database, then reinserting them into the new embedded DB.
- Carefully handle the difference between
size(compressed size) vs.raw_size(uncompressed size).
A dedicated CLI command will assist with migrations and bridging to Python-based AiiDA tools.
(Planned for v2)
The goal is to use io_uring for non-blocking, efficient I/O on supported Linux kernels, thus removing the need for blocking thread pools.
Deprecated (see io_uring above)
Originally, timeouts were planned for large file operations to prevent blocking. With io_uring, blocking becomes less of an issue. Hence, the timeout design has been deprecated.
Deprecated (see io_uring above)
While tokio/fs simulates asynchronous file I/O, it internally uses blocking system calls (with a thread pool). The shift to io_uring will address true asynchronous file I/O at the system level.
When exposing Rust implementations to Python via PyO3:
-
Python → Rust (Insertion)
Wrap Python file-like objects (BinaryIO,StringIO, etc.) in aPyFileLikeObjectto create a RustReader. -
Rust → Python (Extraction)
Reading from RSDOS returns a genericObject<R>(loose or packed). For simplicity, it is converted back to aPyFileLikeObjectfor Python.
These conversions ensure a smooth streaming interface on both sides.
- Deduplication: Files with identical content share a single storage instance (thanks to hash-based IDs).
- Compression: Zstd typically outperforms zlib.
- Loose vs. Packed: Loose is faster for small inserts; packing is more efficient for batch storage.
- Excessive allocations for metadata on each read.
- Manual resource management (e.g., container close calls).
- Less efficient DB or compression approach in some cases.
- No explicit
close()in RSDOS; Rust’s drop behavior handles cleanup automatically. - Certain legacy exceptions (
FileNotFoundError,NotInitializedError) are replaced by standard Rust error propagation. - Configuration parameters (e.g.,
loose_prefix_len,pack_size_target) live inConfigrather than container methods.