GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

EnricoMi · 2024-10-18T11:07:54Z

Rationale for this change

The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed,

What changes are included in this PR?

Preserving the dataset order of rows requires the SourceNode to sequence the fragments output (this keeps exec batches in the order of fragments), to provide an ImplicitOrdering (this gives exec batches an index), and the ConsumingSinkNode to sequence exec batches (finally preserve order of batches according to their index).

User-facing changes:

Add option preserve_order to FileSystemDatasetWriteOptions (C++) and arrow.dataset.write_dataset (Python).

Default behaviour is current behaviour.

Are these changes tested?

Unit tests have been added,

Are there any user-facing changes?

Users can set FileSystemDatasetWriteOptions.preserve_order = true (C++) / arrow.dataset.write_dataset(..., preserve_order=True) (Python).

GitHub Issue: [C++][Dataset] Preserve order when writing dataset #26818

github-actions · 2024-10-18T11:08:27Z

⚠️ GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

gitmodimo · 2024-10-29T20:05:33Z

This pull request seems to functionally overlap with this one. Some changes are almost exactly the same. Ordering of data is kept in threaded execution with use of batch index. Can you check whether it fixes your use case also?

EnricoMi · 2024-10-31T09:57:53Z

cpp/src/arrow/acero/options.h

@@ -103,8 +103,8 @@ class ARROW_ACERO_EXPORT SourceNodeOptions : public ExecNodeOptions {
  std::shared_ptr<Schema> output_schema;
  /// \brief an asynchronous stream of batches ending with std::nullopt
  std::function<Future<std::optional<ExecBatch>>()> generator;
-
-  Ordering ordering = Ordering::Unordered();


The constructor has a default value for ordering and initializes ordering with the value given to the constructor. No point for another default value here, I think.

gitmodimo · 2024-10-31T15:45:52Z

cpp/src/arrow/dataset/file_base.cc

+                          acero::ConsumingSinkNodeOptions{
+                              std::move(consumer),
+                              {},
+                              /*sequence_output=*/write_options.preserve_order}));


TeeNode needs the same treatment

Right, looks like this requires some refactoring, since TeeNode is not used for writing datasets I'd leave this to a separate PR.

gitmodimo · 2024-10-31T16:13:10Z

Since you are fixing dataset write ordering I think this check never fires. It should be moved to InsertBatch.
Also probaly AccumulationQueue, SequencingQueue and SerialSequencingQueue should be exported for acero nodes developers.

EnricoMi · 2024-12-04T08:54:34Z

@gitmodimo I think that refactoring should be done in a separate PR keeping this PR focused on fixing the issue.

EnricoMi · 2025-02-11T08:49:51Z

@zanmato1984 this touches related code area as #44616. Hoping you could take a look when you find time.

zanmato1984 · 2025-02-11T11:35:03Z

Hi @EnricoMi , I can take a look. Just first glance but do you think the PR description could be updated accordingly?

github-actions · 2025-02-11T14:47:49Z

⚠️ GitHub issue #26818 has been automatically assigned in GitHub to PR creator.

cpp/src/arrow/dataset/file_test.cc

cpp/src/arrow/dataset/write_node_test.cc

cpp/src/arrow/dataset/file_test.cc

cpp/src/arrow/dataset/write_node_test.cc

python/pyarrow/dataset.py

zanmato1984 · 2025-02-19T09:48:15Z

Also, you might want to update the PR description to reflect its latest purpose.

EnricoMi · 2025-03-07T10:02:02Z

Also, you might want to update the PR description to reflect its latest purpose.

I think the PR description is up-to-date, do you see any discrepancy?

EnricoMi · 2025-05-09T10:18:18Z

python/pyarrow/tests/test_dataset.py

@@ -4591,6 +4591,22 @@ def file_visitor(written_file):
    assert result1.to_table().equals(result2.to_table())


+@pytest.mark.parquet
+@pytest.mark.pandas
+def test_write_dataset_use_threads_preserve_order(tempdir):


Added test code from issue description. The test fails with preserve_order=False.

zanmato1984

LGTM

zanmato1984 · 2025-05-13T21:29:18Z

Let's wait for @rok a while before I can merge this. Thanks.

zanmato1984 · 2025-05-13T21:30:11Z

@github-actions crossbow submit -g cpp -g python

github-actions · 2025-05-13T21:33:18Z

Revision: dfd958d

Submitted crossbow builds: ursacomputing/crossbow @ actions-58bcfa66f9

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-alpine-linux-cpp
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-meson
test-conda-cpp-valgrind
test-conda-python-3.10
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-latest-numpy-1.26
test-conda-python-3.11-pandas-latest-numpy-latest
test-conda-python-3.11-pandas-nightly-numpy-nightly
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.13
test-conda-python-3.9
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5
test-conda-python-emscripten
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-cuda-python-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-cpp
test-fedora-39-python-3
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-22.04-python-3
test-ubuntu-22.04-python-313-freethreading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer
test-ubuntu-24.04-python-3

rok · 2025-05-14T07:35:34Z

Thanks for doing this @EnricoMi ! This was a long standing issue.

EnricoMi · 2025-05-14T07:38:26Z

Thanks everyone for the thorough review!

conbench-apache-arrow · 2025-05-14T15:06:37Z

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 021d8ab.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 6 possible false positives for unstable benchmarks that are known to sometimes produce them.

rok · 2025-05-14T15:55:36Z

Perhaps it'd be worth it to add a benchmark for preserve_order == true to ensure there's no future regressions. Or would the nodes this uses already be benchmarked?

EnricoMi · 2025-05-15T06:26:29Z

I'll look into this!

…ti-threaded (apache#44470) ### Rationale for this change The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed, ### What changes are included in this PR? Preserving the dataset order of rows requires the `SourceNode` to sequence the fragments output (this keeps exec batches in the order of fragments), to provide an `ImplicitOrdering` (this gives exec batches an index), and the `ConsumingSinkNode` to sequence exec batches (finally preserve order of batches according to their index). User-facing changes: - Add option `preserve_order` to `FileSystemDatasetWriteOptions` (C++) and `arrow.dataset.write_dataset` (Python). Default behaviour is current behaviour. ### Are these changes tested? Unit tests have been added, ### Are there any user-facing changes? Users can set `FileSystemDatasetWriteOptions.preserve_order = true` (C++) / `arrow.dataset.write_dataset(..., preserve_order=True)` (Python). * GitHub Issue: apache#26818 Lead-authored-by: Enrico Minack <[email protected]> Co-authored-by: Rok Mihevc <[email protected]> Signed-off-by: Rok Mihevc <[email protected]>

…ti-threaded (apache#44470) (#5) ### Rationale for this change The order of rows in a dataset might be important for users and should be preserved when writing to a filesystem. With multi-threaded write, the order is currently not guaranteed, ### What changes are included in this PR? Preserving the dataset order of rows requires the `SourceNode` to sequence the fragments output (this keeps exec batches in the order of fragments), to provide an `ImplicitOrdering` (this gives exec batches an index), and the `ConsumingSinkNode` to sequence exec batches (finally preserve order of batches according to their index). User-facing changes: - Add option `preserve_order` to `FileSystemDatasetWriteOptions` (C++) and `arrow.dataset.write_dataset` (Python). Default behaviour is current behaviour. ### Are these changes tested? Unit tests have been added, ### Are there any user-facing changes? Users can set `FileSystemDatasetWriteOptions.preserve_order = true` (C++) / `arrow.dataset.write_dataset(..., preserve_order=True)` (Python). * GitHub Issue: apache#26818 Lead-authored-by: Enrico Minack <[email protected]> Signed-off-by: Rok Mihevc <[email protected]> Co-authored-by: Rok Mihevc <[email protected]>

EnricoMi requested a review from westonpace as a code owner October 18, 2024 11:07

github-actions bot added Component: C++ Component: Python awaiting review Awaiting review labels Oct 18, 2024

EnricoMi mentioned this pull request Oct 18, 2024

[C++][Dataset] Preserve order when writing dataset #26818

Closed

EnricoMi force-pushed the preserve-order-2 branch 2 times, most recently from 9ca8c76 to 28cb588 Compare October 18, 2024 17:02

gitmodimo mentioned this pull request Oct 29, 2024

GH-41706: [C++][Acero] Enhance asof_join to work in multi-threaded execution by sequencing input #44083

Merged

EnricoMi force-pushed the preserve-order-2 branch from 28cb588 to 0377ef9 Compare October 31, 2024 09:56

EnricoMi commented Oct 31, 2024

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 31, 2024

EnricoMi force-pushed the preserve-order-2 branch from 2776b34 to 6024066 Compare October 31, 2024 12:39

gitmodimo reviewed Oct 31, 2024

View reviewed changes

rok requested a review from pitrou November 7, 2024 12:18

adamreeve mentioned this pull request Nov 20, 2024

pyarrow.dataset.write_dataset do not preserve order #39030

Open

EnricoMi force-pushed the preserve-order-2 branch from 6024066 to 697c378 Compare November 22, 2024 09:42

EnricoMi force-pushed the preserve-order-2 branch from 697c378 to a9a78ce Compare January 21, 2025 09:12

EnricoMi force-pushed the preserve-order-2 branch from a9a78ce to f56b166 Compare January 29, 2025 18:22

EnricoMi force-pushed the preserve-order-2 branch from f56b166 to b750bb3 Compare February 11, 2025 08:46

zanmato1984 reviewed Feb 19, 2025

View reviewed changes

EnricoMi force-pushed the preserve-order-2 branch from f5a18b2 to 3726f87 Compare May 9, 2025 08:44

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 9, 2025

Add comment to probabilistic write test

da909aa

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 9, 2025

Add Python test from issue description

dfd958d

EnricoMi force-pushed the preserve-order-2 branch from f9b3549 to dfd958d Compare May 9, 2025 10:17

EnricoMi commented May 9, 2025

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 9, 2025

zanmato1984 approved these changes May 9, 2025

View reviewed changes

rok approved these changes May 14, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels May 14, 2025

rok merged commit 021d8ab into apache:main May 14, 2025
44 of 45 checks passed

rok removed the awaiting merge Awaiting merge label May 14, 2025

EnricoMi deleted the preserve-order-2 branch May 14, 2025 07:38

gitmodimo mentioned this pull request May 15, 2025

[C++][Dataset] Preserve order when writing dataset using TeeNode #46454

Open

EnricoMi mentioned this pull request May 23, 2025

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded (#44470) G-Research/arrow#5

Merged

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

GH-26818: [C++][Python] Preserve order when writing dataset multi-threaded #44470

Uh oh!

Conversation

EnricoMi commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 18, 2024

Uh oh!

gitmodimo commented Oct 29, 2024

Uh oh!

EnricoMi Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

gitmodimo Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

EnricoMi Nov 22, 2024

Choose a reason for hiding this comment

Uh oh!

gitmodimo commented Oct 31, 2024

Uh oh!

EnricoMi commented Dec 4, 2024

Uh oh!

EnricoMi commented Feb 11, 2025

Uh oh!

zanmato1984 commented Feb 11, 2025

Uh oh!

github-actions bot commented Feb 11, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zanmato1984 commented Feb 19, 2025

Uh oh!

EnricoMi commented Mar 7, 2025

Uh oh!

EnricoMi May 9, 2025

Choose a reason for hiding this comment

Uh oh!

zanmato1984 left a comment

Choose a reason for hiding this comment

Uh oh!

zanmato1984 commented May 13, 2025

Uh oh!

zanmato1984 commented May 13, 2025

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

Uh oh!

rok commented May 14, 2025

Uh oh!

EnricoMi commented May 14, 2025

Uh oh!

conbench-apache-arrow bot commented May 14, 2025

Uh oh!

rok commented May 14, 2025

Uh oh!

EnricoMi commented May 15, 2025

Uh oh!

Uh oh!

EnricoMi commented Oct 18, 2024 •

edited

Loading