WIP: librz/diff: speedup distance and bytes diff with FastCDC chunking#6485
Draft
notxvilka wants to merge 2 commits into
Draft
WIP: librz/diff: speedup distance and bytes diff with FastCDC chunking#6485notxvilka wants to merge 2 commits into
notxvilka wants to merge 2 commits into
Conversation
…edit distance rz-diff -d myers runs an exact O(ND) edit distance over the whole file, so its cost grows with the square of the edit distance and becomes impractical on large, substantially different inputs. Add rz_diff_cdc_distance(): for inputs above 1 MiB it splits both buffers into content-defined chunks with FastCDC (Gear rolling hash, normalized chunking), treats byte-identical chunks (hash match reconfirmed with memcmp) as in-order anchors, and runs Myers only inside the gaps between anchors, summing the per-gap distances. The anchors are verified-identical, so the result is the exact sum of per-gap distances and a tight upper bound on the global distance. Below the threshold the call forwards to rz_diff_myers_distance(), so small-file results are unchanged. The Gear table is derived locally, keeping the API thread-safe.
Add a `-t cdc-bytes` mode, the unified-diff counterpart to `-d cdc-myers`. For inputs above 1MiB the byte matcher is anchored on byte-identical FastCDC chunks and the exact difflib matcher runs only inside the gaps between anchors; the anchors themselves are emitted as equal spans. The resulting matches feed the existing opcodes/grouping/rendering pipeline unchanged, so the unified text/JSON output format is identical to `bytes`. Because content-defined chunk boundaries re-sync after insertions and deletions, cdc-bytes stays aligned across shifts that make plain difflib degenerate into a huge noisy diff, while running markedly faster. Below the threshold rz_diff_bytes_cdc_new() falls back to the exact matcher, so small-file output is byte-for-byte identical to rz_diff_bytes_new(). The FastCDC anchoring is factored out of the cdc-myers distance code and shared, so both paths reuse the same thread-safe chunker. Adds unit tests (below-threshold equivalence, large-identical, and a representation- independent reconstruction check across shifting edits) and rz-diff db tests for the new mode.
wargio
reviewed
Jun 8, 2026
Member
wargio
left a comment
There was a problem hiding this comment.
i was kinda writing this. also gear is not random in your implementation. actually i dont even see it initialized.
wargio
reviewed
Jun 8, 2026
Comment on lines
+109
to
+112
| ut64 state = 0xdeadbeefcafebabeULL; | ||
| for (int i = 0; i < 256; i++) { | ||
| cfg->gear[i] = cdc_splitmix64(&state); | ||
| } |
Member
There was a problem hiding this comment.
this should be random and you can pre-calculate this directly in a static table.
Contributor
Author
|
It's WIP, and currently experimenting with different implementation already. This code doesn't handle well some cases of big gaps and reshuffling. You can continue work on your implementation, I will focus on creating realistic testcases/benches then. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Your checklist for this pull request
RZ_APIfunction and struct this PR changes.RZ_API).Detailed description
A clean follow up of #6458
Test plan
rz-diff -t bytesandrz-diff -t cdc-byteson various big files (bigger than 1Mb)