fix: clear CommitState buffer on new commit to prevent duplicate rows#15651
fix: clear CommitState buffer on new commit to prevent duplicate rows#15651t3hw wants to merge 2 commits intoapache:mainfrom
Conversation
Stale DataWritten events accumulate in commitBuffer when Coordinator.doCommit() throws, causing duplicate rows when committed alongside new events in a single RowDelta. Clear the buffer at the top of startNewCommit() so each cycle starts clean.
|
@bryanck hey, i was researching a duplication issue and it led me to this proposed solution. |
|
Hi, thanks for looking into this issue. I've been working on the same problem in our fork (based on https://github.com/databricks/iceberg-kafka-connect) and wanted to share some observations about the partial commit (timeout) scenario that this fix may not fully address. When commitTimeoutMs is exceeded, the Coordinator performs a partial commit — it commits only the DataWritten events received so far and moves on. Workers that were still processing at the time will send their Here's the timeline:
With this PR's fix, commitBuffer.clear() in startNewCommit() discards Worker 0's DataWritten(A). Since the partial commit already advanced the control topic consumer offsets (via commitConsumerOffsets()), this event won't be re-consumed. The data files exist on storage but are never committed to the Iceberg table — resulting in data loss. The root cause is broader The underlying issue isn't just stale events lingering in the buffer — it's that tableCommitMap() doesn't distinguish events by commitId. Even if stale events survive in the buffer (which they should, since they Alternative approach: commit-id-based separation Instead of discarding stale events, a safer fix would be to separate them by commitId and commit each group in its own RowDelta. This ensures:
This approach also covers the failed-commit and recovery scenarios your PR addresses, since stale events from any prior cycle are naturally separated by their commitId. We've already implemented this approach in our fork and will be deploying it to production soon. We'll also be opening a PR here shortly. |
prevent duplicate rows without data loss The previous fix (edef7f9) cleared `commitBuffer` in `startNewCommit()` to prevent stale `DataWritten` events from producing same-sequence equality deletes. However, this discards late events from workers that finished after a partial commit (timeout) — those workers already committed source offsets via `sendOffsetsToTransaction()`, so the data is unrecoverable. Instead, group events by commitId and commit each group as a separate RowDelta with its own Iceberg sequence number. Stale groups are committed first (lower seq), so equality deletes from newer groups can apply to older data (`data.seq < delete.seq`). Error handling uses a three-stage escalation: 1. Blocking retries (configurable, default 3) preserve ordering 2. Failure policy (`fail` stops the connector, `non-blocking` proceeds with ordering inversion risk) 3. TTL eviction (default 1 hour) with ERROR log of orphaned file paths Other changes: - Per-group offset computation prevents stale snapshots from filtering out current envelopes (offset poisoning fix) - Min-offset consumer advancement on partial success bounds re-consumption on restart - JMX MBean (`CommitStateMXBean`) exposes buffer metrics - Three new configs: `stale-ttl-ms`, `stale-max-blocking-retries`, `stale-failure-policy`
|
I'd like to share some context on how this fix was developed and what alternatives were considered. Disclaimer: I used Claude extensively for the investigation, design, and implementation of this fix. I validated every conclusion against the code, but want to be transparent about the tooling. Approaches rejectedAfter @koodin9's comment identified the data loss with
Future enhancement: Worker-side rollforwardThe current fix handles stale events at the Coordinator. A cleaner long-term approach would eliminate them at the source. Today, Workers fire-and-forget This eliminates stale events entirely, removing the need for per-group ordering, retry escalation, and TTL eviction. It's deferred because it touches Worker, Channel, SinkWriter, and the control topic protocol — a scope that warrants its |
|
@jerryzhujing @t3hw |
Fix: Per-commitId group separation prevents duplicate rows and data loss after failed commits
Background
The previous version of this fix cleared
commitBufferunconditionally instartNewCommit(). While simple, @koodin9identified that this causes data loss in the partial commit (timeout) scenario: workers that finish after the timeout have already committed source offsets atomically via
sendOffsetsToTransaction(), so theirDataWrittenevents cannot be re-produced.Clearing the buffer discards them, orphaning data files on storage with no way to recover.
This prompted a redesign. Rather than discarding stale events, we commit them separately —
each commitId group in its own
RowDeltawith a distinct Iceberg sequence number.Problem
When
Coordinator.doCommit()fails,clearResponses()is never called. StaleDataWrittenevents persist incommitBuffer. On the next successful commit,tableCommitMap()returns all events — stale and current — in a singleRowDelta. Allfiles receive the same sequence number, making equality deletes powerless (
data.seq < delete.seqrequired, strictly less than). Duplicate rows appear.This is workload-agnostic — it affects CDC, upsert, and append-only modes.
Fix
CommitState.java
startNewCommit()no longer clears the buffer. Instead,tableCommitGroups()groupsevents by commitId (preserving insertion order via
LinkedHashMap) and returns an orderedlist of
CommitGroupobjects per table. Stale groups sort before current groups becausethey were consumed from the control topic earlier.
Additional changes:
removeEnvelopes(Collection)— selectively removes committed groups' envelopesrecordGroupFailure(),isGroupBlocking(), per-commitIdretry counter
stale-ttl-ms(default 1 hour) are evicted withan
ERRORlog containing orphaned file paths for manual recoveryCommitStateMXBean— JMX interface exposingEvictedStaleEventCount,StaleGroupCount,BufferSizeCoordinator.java
doCommit()rewritten to iterate commitId groups per table sequentially (oldest first).Each group is committed via the existing
commitToTable()method, which now accepts acommitIdparameter.Error handling follows a three-stage escalation:
same table to preserve sequence number ordering.
fail(default): throwsConnectException, moving the connector toFAILEDstate.non-blocking: proceeds with current group, accepting ordering inversion risk.Stale groups write only their own envelope offsets to the snapshot (not the global consumer
position), preventing an offset poisoning bug where subsequent groups' envelopes would be
incorrectly filtered out.
When the buffer is not fully drained after a commit, consumer offsets are advanced to the
minimum uncommitted envelope offset per partition, bounding re-consumption on restart.
Channel.java
Added
commitConsumerOffsetsTo(Map)for partial consumer offset advancement.IcebergSinkConfig.java
Three new configuration properties:
iceberg.control.commit.stale-ttl-msiceberg.control.commit.stale-max-blocking-retriesiceberg.control.commit.stale-failure-policyfailfailornon-blocking.Test coverage
CommitState unit tests (
TestCommitState.java— 8 tests):testTableCommitGroupsSingleCommitId— single group, correct envelope counttestTableCommitGroupsMultipleCommitIds— stale + current groups, ordered correctlytestTableCommitGroupsPreservesInsertionOrder— interleaved commitIds, first-seen ordertestStartNewCommitDoesNotClearBuffer— stale events survive across cyclestestRemoveEnvelopesSelectiveRemoval— only specified envelopes removedtestIsBufferEmpty— empty after clear, not empty after addtestIsCommitReady— partition count filtering by commitId (existing)testGetValidThroughTs— timestamp calculation (existing)Coordinator integration tests (
TestCoordinator.java— 3 new):testMultiGroupCommitProducesSequentialSnapshots— stale + current groups produce 2snapshots with strictly increasing sequence numbers and distinct commitIds
testStaleGroupFailureSkipsCurrentGroup— bad stale data blocks current group, 0snapshots, buffer retained, no
CommitCompletesenttestStaleSucceedsCurrentFails— stale group commits (1 snapshot), current group'senvelopes retained, no
CommitCompletesentRecovery scenario tests (
TestRecoveryScenario.java— 1 new):testSequentialCommitsProduceSeparateSnapshots— sequential commits produce 2 snapshotswith increasing sequence numbers (prerequisite for equality deletes to work across batches;
full CDC proof is in
TestCDCDeltaWriterTableLevel)Verification
All tests pass:
Red-green: removing the per-group offset computation in
doCommit()(reverting to theglobal
ctlOffsetsfor all groups) causestestMultiGroupCommitProducesSequentialSnapshotsto fail — the current group's envelopes are filtered out by the stale group's snapshot
offsets. Restoring per-group offsets fixes the test.