Skip to content

Flink: Hold back equality delete converter watermark until completion#17038

Merged
huaxingao merged 1 commit into
apache:mainfrom
mxm:resolve-equality-deletes.test-flakiness
Jul 3, 2026
Merged

Flink: Hold back equality delete converter watermark until completion#17038
huaxingao merged 1 commit into
apache:mainfrom
mxm:resolve-equality-deletes.test-flakiness

Conversation

@mxm

@mxm mxm commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

This PR fixes the CI test flakiness seen in:

TestConvertEqualityDeletesE2E > testConvertEqualityDeletesE2E(String) > [1] staging FAILED
    org.opentest4j.AssertionFailedError:
    expected: 2L
     but was: 1L
        at app//org.apache.iceberg.flink.maintenance.api.TestConvertEqualityDeletesE2E.lambda$testConvertEqualityDeletesE2E$1(TestConvertEqualityDeletesE2E.java:127)

e.g.: https://github.com/apache/iceberg/actions/runs/28499724055/job/84473933155?pr=16293

Flink's table maintenance framework maintains a lock to prevent concurrent execution of maintenance tasks. The component responsible for removing the lock (LockRemover) releases the task lock once a watermark reaches it past the task's start timestamp.

EqualityConvertPlanner emits phase watermarks in the middle of its execution, and the committer forwarded them, so the lock was released before the run completed. The maintenance framework then started a next cycle concurrently, and both re-processed the same uncommitted staging snapshot. The overlapping commits conflicted, and one advanced its commit marker while dropping its deletion vector, causing the test flakiness.

The solution is to forward watermarks from the committer only after it finishes the conversion cycle, to ensure mutual exclusive execution of the maintenance tasks.

This PR fixes the CI test flakiness seen in:

```
TestConvertEqualityDeletesE2E > testConvertEqualityDeletesE2E(String) > [1] staging FAILED
    org.opentest4j.AssertionFailedError:
    expected: 2L
     but was: 1L
        at app//org.apache.iceberg.flink.maintenance.api.TestConvertEqualityDeletesE2E.lambda$testConvertEqualityDeletesE2E$1(TestConvertEqualityDeletesE2E.java:127)
```
e.g.:
https://github.com/apache/iceberg/actions/runs/28499724055/job/84473933155?pr=16293

Flink's table maintenance framework maintains a lock to prevent concurrent
execution of maintenance tasks. The component responsible for removing the
lock (LockRemover) releases the task lock once a watermark reaches it past the
task's start timestamp.

EqualityConvertPlanner emits phase watermarks in the middle of its execution,
and the committer forwarded them, so the lock was released before the run
completed. The maintenance framework then started a next cycle concurrently, and
both re-processed the same uncommitted staging snapshot. The overlapping commits
conflicted, and one advanced its commit marker while dropping its deletion
vector, causing the test flakiness.

The solution is to forward watermarks from the committer only after it finishes
the conversion cycle, to ensure mutual exclusive execution of the maintenance
tasks.
@github-actions github-actions Bot added the flink label Jul 2, 2026

@huaxingao huaxingao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@huaxingao huaxingao merged commit 49b89a8 into apache:main Jul 3, 2026
31 checks passed
@huaxingao

Copy link
Copy Markdown
Contributor

Thanks @mxm for the PR! Thanks @wombatu-kun for the review!

@mxm

mxm commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for reviewing @wombatu-kun @huaxingao! Thanks for merging @huaxingao!

Here is the backport: #17067

nastra pushed a commit that referenced this pull request Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants