[data] Add Dataset.write_datasink_lazy to support intermediate outputs. #52094

basveeling · 2025-04-08T14:29:22Z

This PR proposes extending the dataset api with Dataset.write_datasink_lazy. I'd be happy to discuss and welcome any comments and can finalize the PR with additional tests and documentation if there's interest.

 def write_datasink_lazy(
        self,
        datasink: Datasink,
        *,
        prefilter_fn: Callable[[Block], Block] | None = None,
        ray_remote_args: dict[str, Any] = None,
        concurrency: Optional[int] = None,
    ) -> "Dataset":
    """Writes the dataset to a custom :class:`~ray.data.Datasink` lazily while allowing subsequent data operations."""

Why are these changes needed?

Some Ray Data pipelines benefit from writing intermediate outputs, for example:

During development, one wants to cache the output in the middle of a DAG and debug subsequent dataset operations from cache.
Some data operations generate multiple outputs which need to be written into separate datasinks.

This was partly inspired by https://deepseek-ai.github.io/smallpond/'s ability to handle multiple outputs using https://deepseek-ai.github.io/smallpond/generated/smallpond.dataframe.Session.wait.html#smallpond.dataframe.Session.wait . This PR takes a different approach by providing a write node that passes data through transparantly for further processing.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
TODO: I've included any doc changes needed for https://docs.ray.io/en/master/.
- TODO I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
TODO I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: basveeling <[email protected]>

basveeling · 2025-04-08T14:33:15Z

One point to be discussed is if and how this should handle the Datasink.on_write_complete() hook. We could check if the provided Datasink implements this method and throw an error, or look for ways to retrieve the WriteResult

Signed-off-by: basveeling <[email protected]>

richardliaw · 2025-04-10T17:23:44Z

This is quite nice, thanks for the contribution! Will take a quick look.

richardliaw · 2025-04-10T17:24:53Z

python/ray/data/dataset.py

+        self,
+        datasink: Datasink,
+        *,
+        prefilter_fn: Optional[Callable[[Block], Block]] = None,


Could we remove the prefilter_fn for now to maintain consistency with write_datasink?

richardliaw · 2025-04-10T17:28:20Z

python/ray/data/_internal/planner/plan_lazy_write_op.py

+def generate_lazy_write_fn(
+    datasink_or_legacy_datasource: Union[Datasink, Datasource],
+    prefilter_fn: Optional[Callable[[Block], Block]] = None,
+    **write_args,
+) -> Callable[[Iterator[Block], TaskContext], Iterator[Block]]:
+    def fn(blocks: Iterator[Block], ctx: TaskContext) -> Iterator[Block]:
+        """Writes the blocks to the given datasink or legacy datasource.
+
+        Outputs the original blocks to be written."""
+        # Create a copy of the iterator, so we can return the original blocks.
+        it1, it2 = itertools.tee(blocks, 2)
+        if isinstance(datasink_or_legacy_datasource, Datasink):
+            # Apply the prefilter function to each block before writing
+            if prefilter_fn is not None:
+                it1 = (prefilter_fn(block) if len(block) else block for block in it1)
+            ctx.kwargs["_datasink_write_return"] = datasink_or_legacy_datasource.write(
+                it1, ctx
+            )
+        else:
+            datasink_or_legacy_datasource.write(it1, ctx, **write_args)
+
+        return it2
+
+    return fn
+


This isn't much different than generate_write_fn right?

python/ray/data/_internal/planner/plan_lazy_write_op.py

richardliaw · 2025-04-10T17:31:46Z

python/ray/data/_internal/planner/plan_lazy_write_op.py

+    # TODO: figure out how to handle on_write_complete()
+    return MapOperator.create(


yeah indeed, we'll want to figure this part out (and on_write_failed)

richardliaw · 2025-04-10T17:38:44Z

@raulchen - any thoughts on how to handle this properly? main questions:

if we do a lazy write, do we still want to generate stats (like

ray/python/ray/data/_internal/planner/plan_write_op.py

Line 94 in dd1038c

collect_stats_fn = generate_collect_write_stats_fn()

)

How do we support things like on_write_complete or on_write_failed?

ray/python/ray/data/dataset.py

Lines 4178 to 4182 in dd1038c

    
               datasink.on_write_complete(write_result) 
        
           except Exception as e: 
        
               datasink.on_write_failed(e) 
        
               raise

raulchen · 2025-04-16T23:12:38Z

Thanks for your contribution. This is a nice feature.

Regarding the API, I prefer just adding a lazy flag to the existing write_xxx APIs. otherwise we'll have to make an async copy for each write API.
The current write_datasink_lazy implementation has an implication that the input data will be preserved in the outputs of the write. This definitely makes sense. But I'm thinking maybe we can decouple this behavior with sync/async. I.E., we have 2 flags: 1) async to determine if execution will be triggered immediately. 2) preserve_data to determine if data will be preserved in the outputs of write_xxx.
on_write_complete still needs to be supported. Because some data sinks (e.g., Lance) depend on this API to perform a commit operation. We can update the implementation to propagate the WriteResults via BlockMetadata, instead of via Blocks. The WriteResults will be accessible to the data sinks. We can still print the stats by default.

basveeling added 2 commits April 8, 2025 16:12

Add Dataset.write_datasink_lazy

31f5d96

Signed-off-by: basveeling <[email protected]>

Change docstring.

7ecf416

Signed-off-by: basveeling <[email protected]>

basveeling requested a review from a team as a code owner April 8, 2025 14:29

hainesmichaelc added the community-contribution Contributed by the community label Apr 9, 2025

basveeling added 3 commits April 9, 2025 08:17

Fix import.

f228a34

Signed-off-by: basveeling <[email protected]>

Fix prefilter_fn annotations.

155041e

Signed-off-by: basveeling <[email protected]>

Merge branch 'master' into master

551e7f8

jcotant1 added the data Ray Data-related issues label Apr 10, 2025

richardliaw reviewed Apr 10, 2025

View reviewed changes

python/ray/data/_internal/planner/plan_lazy_write_op.py Show resolved Hide resolved

richardliaw reviewed Apr 10, 2025

View reviewed changes

richardliaw assigned raulchen Apr 10, 2025

richardliaw added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Apr 18, 2025

hainesmichaelc added community-backlog and removed community-backlog labels May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Add Dataset.write_datasink_lazy to support intermediate outputs. #52094

[data] Add Dataset.write_datasink_lazy to support intermediate outputs. #52094

Uh oh!

basveeling commented Apr 8, 2025 •

edited

Loading

Uh oh!

basveeling commented Apr 8, 2025

Uh oh!

richardliaw commented Apr 10, 2025

Uh oh!

richardliaw Apr 10, 2025

Uh oh!

richardliaw Apr 10, 2025

Uh oh!

Uh oh!

richardliaw Apr 10, 2025

Uh oh!

richardliaw commented Apr 10, 2025

Uh oh!

raulchen commented Apr 16, 2025

Uh oh!

Uh oh!

		# TODO: figure out how to handle on_write_complete()
		return MapOperator.create(

[data] Add Dataset.write_datasink_lazy to support intermediate outputs. #52094

Are you sure you want to change the base?

[data] Add Dataset.write_datasink_lazy to support intermediate outputs. #52094

Uh oh!

Conversation

basveeling commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

basveeling commented Apr 8, 2025

Uh oh!

richardliaw commented Apr 10, 2025

Uh oh!

richardliaw Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

richardliaw Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

richardliaw Apr 10, 2025

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Apr 10, 2025

Uh oh!

raulchen commented Apr 16, 2025

Uh oh!

Uh oh!

basveeling commented Apr 8, 2025 •

edited

Loading