feat(dcp): dcp optimized s3reader for faster and partial DCP loading #378

jet-tong · 2025-10-07T18:50:10Z

Description

DCPOptimizedS3Reader optimizes PyTorch Distributed Checkpoint (DCP) partial loading by 1/ exploiting sequential access patterns to avoid BytesIO buffer copy, and 2/ only fetching required byte ranges instead of entire objects. This can increase DCP loading performance ~10% to 30%, and even more when loading parts of the checkpoint.

DCPOptimizedS3Reader:
- Coalesces nearby ranges and manages multiple ranged streams per object
- Requires sequential access over each ReadItem; this is enforced via Load Ordering PR.
DCP Integration: S3StorageReader automatically injects range metadata from DCP load plans when providing dcp_optimized() as reader_constructor

Usage:

reader_constructor = S3ReaderConstructor.dcp_optimized()
storage_reader = S3StorageReader(region, path, reader_constructor=reader_constructor)

Optimized for partial DCP loading where only specific items are needed from large distributed checkpoint files.

Additional context

I have updated the CHANGELOG or README if appropriate

Related items

Relies on: perf(dcp): load ordering - sort load items by storage offset #372

Testing

By submitting this pull request, I confirm that my contribution is made under the terms of BSD 3-Clause License and I agree to the terms of the LICENSE.

- Update SequentialS3Reader to support partial reads (and added logs) - New ListOfRangesS3Reader - Coalesces ranges to form chunks of ranges - Manages ranged SequentialS3Reader instances for each chunk - Maps each read / readinto / seek request to each s3reader instance - Integrate this reader into S3StorageReader (force ListOfRangesS3Reader for now) via S3ReaderConstructor params for list of ranges.

Add DCPListOfRangesConstructor and dcp_list_of_ranges() factory method to enable DCP range optimization through reader_constructor parameter. Includes better range injection logic and support for both direct ListOfRanges usage and DCP optimization. Users can now opt-in via: reader_constructor=S3ReaderConstructor.dcp_list_of_ranges()

- type annotations, missing arguments / return statements, etc - minor logic/name changes in list_of_ranges.py - very minor change to fix mypy error on test_user_agent.py

This commit improves performance of ListOfRangesS3Reader by up to 30% for DCP load: - Remove dependency on SequentialS3Reader for self-managed streams - Implement direct stream management with per-group buffering - Optimize read() method with no BytesIO buffer assuming sequential reading - We now enforce non-seekable behaviour to force sequential reading patterns This implementation is now significantly faster for distributed checkpoint loading patterns while maintaining correctness for sequential access. This relies on load ordering optimisation which enforces sequential reading with read() operations, but will not work with readinto() operations since those still have backward seek patterns.

- Rename list of ranges / dcp list of ranges to DCP optimized - Allow max_gap_size to be passed through via S3ReaderConstructor Since the reader now does both list of ranges AND DCP optimisation by exploiting and requiring sequential access, I am renaming them to dcp optimized instead to better reflect its scope.

- With import changes for other files - and updated some missed renames in comments and docstrings

- Use 200MB (arbitrary value for now) as DEFAULT_MAX_GAP_SIZE - Place in dcp_optimized.py for single source of truth

- Add dcp_reader_constructor fixture for DCP tests - Update test_e2e_s3_file_system.py to use dcp_reader_constructor fixture - Update test_e2e_s3_storage_reader.py load ordering test to also cover dcop-optimized s3 reader

…ibility Allows PyTorch versions pre-torch==2.7.0 to use our optimisations, which implicitly assumes the provided stream is seekable. Allows Python 3.8 (uses torch==2.4.1) tests to pass by allowing backward seeks within each LoadItem. This commit essentially offloads the PyTorch BytesIO logic to our reader, by reverting to seekable=True, reading each LoadItem into new internal BytesIO buffer, and handling the read/readinto calls with the extra offset. - Add item-based buffering with BytesIO for full seekability - Move streaming logic from read() into _stream_range_data() - Add _load_item_buffer() for on-demand item loading - Add _find_item_for_position() with fast paths (check current/next item first) - Rewrite read()/readinto() to use buffer operations using current_item_buffer - Remove seekable() -> False to enable PyTorch to seek S3Reader directly

Replace multi-stream cache with single active stream state.

…essages - Refactor core functions to remove redundancies - Add docstrings and error messages - Add input validation for read/readinto/seek methods

- Use filename instead of item_md.relative_path in prepare_local_plan - Use easydict for a cleaner approach

- Renaming to ItemRange since each represents the ranges of a ReadItem in PyTorch DCP LoadPlan - Minor docstring update / kwarg removals / comments

- Remove fragile S3 key parsing in both filename extractions - Only difference should be path/file/ will return "file" instead of ""

- Add range validation to detect overlapping and invalid ranges - Add extra information in error messages to help with potential failures - Add bounds checking in error messages to prevent IndexError - Fix read() method to properly reject None/negative sizes - Remove duplicate type check in seek() - Minor docstring / comment updates

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

s3torchconnector/src/s3torchconnector/dcp/s3_file_system.py

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

muddyfish · 2025-10-21T10:24:33Z

s3torchconnector/src/s3torchconnector/s3reader/constructor.py

            )

-        if not isinstance(constructor, partial):
+        if isinstance(constructor, DCPOptimizedConstructor):


Same here - this feels pretty janky to me. What's this used for? Just debugging or to actually do something based on it?

User agent - agree this still feels janky.

muddyfish · 2025-10-21T10:47:51Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+
+            # Skip ahead if behind target
+            if current_pos < item.start:
+                skip = min(item.start - current_pos, len(chunk))


The logic here isn't obvious

muddyfish · 2025-10-21T10:49:17Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+        """
+        return self._position
+
+    def close(self) -> None:


Should set _closed

Agree, aligns with BytesIO methods, but will want if _closed checks for all methods which might affect performance.

Alternatively do not use close() at all and let GC cleanup like other readers; GetObjectStream here does not seem closable unlike PutObjectStream anyways.

muddyfish · 2025-10-21T10:50:28Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+        item_range = self._item_ranges[self._current_item_idx]
+        local_pos = self._position - item_range.start
+
+        assert self._current_item_buffer is not None


We should check that the read doesn't take us outside of the current item

Proposing to add check within item_idx = self._find_item_for_position(self._position) that read doesn't exceed item range.

muddyfish · 2025-10-21T10:51:18Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+
+        assert self._current_item_buffer is not None
+        self._current_item_buffer.seek(local_pos)
+        bytes_read = self._current_item_buffer.readinto(buf)


Similarly, should verify the lengths involved here

muddyfish · 2025-10-21T10:52:02Z

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py

+            ValueError: If seeking to negative position or accessing previous items.
+            TypeError: If whence is not SEEK_SET or SEEK_CUR.
+        """
+        if not isinstance(offset, int):


There's no bounds checking here to make sure we stay within the current item

I tried to only put the checks in read/readinto and return the errors there.

Perhaps it's better to place the _find_item_for_position check into seek().

- Remove sort (since we pre-sorted in prepare_local_plan) - Place all imports to the top of s3_file_system - rename rannges as item_ranges for DCPOptimizedS3Reader - Add comments for human-readable error message construction - Update wrong test docstring after changing back to seekable - Minor Todo coments / typing / docstring changes

…structor - Simplify integration in S3StorageReader by moving logic to constructor - Add runtime_checkable decorator to protocols for isinstance checks - Add proper PyTorch DCP type annotations (ReadItem, MetadataIndex, _StorageInfo) - Some renames - S3ReaderConstructorProtocolWithSetRanges to DCPS3ReaderConstructorProtocol - set_ranges method to set_item_ranges_by_file for clarity - _file_ranges field to _item_ranges_by_file to match method name

…heck - Use Dict / List instead of dict / list - Remove redundant boolean check: if self._item_ranges_by_file...

- Reverts some changes to previous version in main - Removed 1 TODO which we did in previous commit

Move torch.distributed.checkpoint imports under TYPE_CHECKING to prevent importlib_metadata eerrors on Python 3.9 when DCP functionality is not used.

jet-tong temporarily deployed to integration-tests October 7, 2025 18:50 — with GitHub Actions Inactive

jet-tong changed the title ~~feat(dcp): list of ranges reader for DCP partial loading~~ [draft] feat(dcp): list of ranges reader for DCP partial loading Oct 7, 2025

jet-tong temporarily deployed to integration-tests October 9, 2025 13:19 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from daef051 to 39853e4 Compare October 14, 2025 19:00

jet-tong temporarily deployed to integration-tests October 14, 2025 19:00 — with GitHub Actions Inactive

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 39853e4 to 08a815a Compare October 14, 2025 19:18

jet-tong temporarily deployed to integration-tests October 14, 2025 19:18 — with GitHub Actions Inactive

jet-tong added 8 commits October 17, 2025 10:54

fix: resolve mypy errors and minor logic and name changes

af01848

- type annotations, missing arguments / return statements, etc - minor logic/name changes in list_of_ranges.py - very minor change to fix mypy error on test_user_agent.py

refactor: rename list_of_ranges.py as dcp_optimized.py

39fa629

- With import changes for other files - and updated some missed renames in comments and docstrings

refactor: use one default max gap size across classes

a515b4a

- Use 200MB (arbitrary value for now) as DEFAULT_MAX_GAP_SIZE - Place in dcp_optimized.py for single source of truth

test(dcp): update dcp e2e tests with DCPOptimizedS3Reader

68165e6

- Add dcp_reader_constructor fixture for DCP tests - Update test_e2e_s3_file_system.py to use dcp_reader_constructor fixture - Update test_e2e_s3_storage_reader.py load ordering test to also cover dcop-optimized s3 reader

jet-tong force-pushed the feat/dcp-list-of-ranges-s3reader branch from 08a815a to 68165e6 Compare October 17, 2025 09:54

jet-tong temporarily deployed to integration-tests October 17, 2025 09:54 — with GitHub Actions Inactive

jet-tong changed the title ~~[draft] feat(dcp): list of ranges reader for DCP partial loading~~ [draft] feat(dcp): dcp optimized s3reader for faster and partial DCP loading Oct 17, 2025

jet-tong temporarily deployed to integration-tests October 17, 2025 16:52 — with GitHub Actions Inactive

refactor: simplify DCPOptimizedS3Reader to use single active stream

5e1322b

Replace multi-stream cache with single active stream state.

jet-tong temporarily deployed to integration-tests October 17, 2025 17:31 — with GitHub Actions Inactive

jet-tong temporarily deployed to integration-tests October 20, 2025 10:22 — with GitHub Actions Inactive

refactor(dcp): refactor core functions and add docstrings and error m…

cc5bfc9

…essages - Refactor core functions to remove redundancies - Add docstrings and error messages - Add input validation for read/readinto/seek methods

jet-tong temporarily deployed to integration-tests October 20, 2025 18:23 — with GitHub Actions Inactive

jet-tong added 4 commits October 20, 2025 22:41

fix(dcp): use filename only for file range key

463a1f1

- Use filename instead of item_md.relative_path in prepare_local_plan - Use easydict for a cleaner approach

refactor: rename RangeRequest as ItemRange

c985b26

- Renaming to ItemRange since each represents the ranges of a ReadItem in PyTorch DCP LoadPlan - Minor docstring update / kwarg removals / comments

fix(dcp): use os to extract basename instead of split

5720907

- Remove fragile S3 key parsing in both filename extractions - Only difference should be path/file/ will return "file" instead of ""

jet-tong temporarily deployed to integration-tests October 21, 2025 09:18 — with GitHub Actions Inactive

jet-tong changed the title ~~[draft] feat(dcp): dcp optimized s3reader for faster and partial DCP loading~~ feat(dcp): dcp optimized s3reader for faster and partial DCP loading Oct 21, 2025

jet-tong commented Oct 21, 2025

View reviewed changes

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py Show resolved Hide resolved

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py Show resolved Hide resolved

s3torchconnector/src/s3torchconnector/s3reader/dcp_optimized.py Show resolved Hide resolved

muddyfish reviewed Oct 21, 2025

View reviewed changes

jet-tong added 2 commits October 21, 2025 14:07

jet-tong temporarily deployed to integration-tests October 21, 2025 14:06 — with GitHub Actions Inactive

fix(s3reader): use Python 3.9 compatible types and remove redundant c…

a9ed023

…heck - Use Dict / List instead of dict / list - Remove redundant boolean check: if self._item_ranges_by_file...

jet-tong temporarily deployed to integration-tests October 21, 2025 14:18 — with GitHub Actions Inactive

refactor: cleanup imports, styling and comments

bd6f274

- Reverts some changes to previous version in main - Removed 1 TODO which we did in previous commit

jet-tong temporarily deployed to integration-tests October 21, 2025 14:33 — with GitHub Actions Inactive

fix: move DCP imports under TYPE_CHECKING

f9a90b8

Move torch.distributed.checkpoint imports under TYPE_CHECKING to prevent importlib_metadata eerrors on Python 3.9 when DCP functionality is not used.

jet-tong deployed to integration-tests October 22, 2025 16:03 — with GitHub Actions Active

feat(dcp): dcp optimized s3reader for faster and partial DCP loading #378

Are you sure you want to change the base?

feat(dcp): dcp optimized s3reader for faster and partial DCP loading #378

Uh oh!

Conversation

jet-tong commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context

Related items

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jet-tong commented Oct 7, 2025 •

edited

Loading