Skip to content

WIP: librz/diff: speedup distance and bytes diff with FastCDC chunking#6485

Draft
notxvilka wants to merge 2 commits into
devfrom
asan-diff-cdc-myers
Draft

WIP: librz/diff: speedup distance and bytes diff with FastCDC chunking#6485
notxvilka wants to merge 2 commits into
devfrom
asan-diff-cdc-myers

Conversation

@notxvilka
Copy link
Copy Markdown
Contributor

Your checklist for this pull request

  • I've read the guidelines for contributing to this repository.
  • I made sure to follow the project's coding style.
  • I've documented every RZ_API function and struct this PR changes.
  • I've added tests that prove my changes are effective (required for changes to RZ_API).
  • I've updated the Rizin book with the relevant information (if needed).
  • I've used AI tools to generate fully or partially these code changes and I'm sure the changes are not copyrighted by somebody else.

Detailed description

A clean follow up of #6458

Test plan

  • Compare rz-diff -t bytes and rz-diff -t cdc-bytes on various big files (bigger than 1Mb)

XVilka added 2 commits June 8, 2026 00:22
…edit distance

rz-diff -d myers runs an exact O(ND) edit distance over the whole file, so
its cost grows with the square of the edit distance and becomes impractical
on large, substantially different inputs.

Add rz_diff_cdc_distance(): for inputs above 1 MiB it splits both buffers
into content-defined chunks with FastCDC (Gear rolling hash, normalized
chunking), treats byte-identical chunks (hash match reconfirmed with memcmp)
as in-order anchors, and runs Myers only inside the gaps between anchors,
summing the per-gap distances. The anchors are verified-identical, so the
result is the exact sum of per-gap distances and a tight upper bound on the
global distance. Below the threshold the call forwards to
rz_diff_myers_distance(), so small-file results are unchanged. The Gear
table is derived locally, keeping the API thread-safe.
Add a `-t cdc-bytes` mode, the unified-diff counterpart to `-d cdc-myers`.
For inputs above 1MiB the byte matcher is anchored on byte-identical
FastCDC chunks and the exact difflib matcher runs only inside the gaps
between anchors; the anchors themselves are emitted as equal spans. The
resulting matches feed the existing opcodes/grouping/rendering pipeline
unchanged, so the unified text/JSON output format is identical to `bytes`.

Because content-defined chunk boundaries re-sync after insertions and
deletions, cdc-bytes stays aligned across shifts that make plain difflib
degenerate into a huge noisy diff, while running markedly faster. Below
the threshold rz_diff_bytes_cdc_new() falls back to the exact matcher, so
small-file output is byte-for-byte identical to rz_diff_bytes_new().

The FastCDC anchoring is factored out of the cdc-myers distance code and
shared, so both paths reuse the same thread-safe chunker. Adds unit tests
(below-threshold equivalence, large-identical, and a representation-
independent reconstruction check across shifting edits) and rz-diff db
tests for the new mode.
Copy link
Copy Markdown
Member

@wargio wargio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was kinda writing this. also gear is not random in your implementation. actually i dont even see it initialized.

Comment thread librz/diff/cdc.c
Comment on lines +109 to +112
ut64 state = 0xdeadbeefcafebabeULL;
for (int i = 0; i < 256; i++) {
cfg->gear[i] = cdc_splitmix64(&state);
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be random and you can pre-calculate this directly in a static table.

@notxvilka
Copy link
Copy Markdown
Contributor Author

It's WIP, and currently experimenting with different implementation already. This code doesn't handle well some cases of big gaps and reshuffling. You can continue work on your implementation, I will focus on creating realistic testcases/benches then.

@notxvilka notxvilka changed the title librz/diff: speedup distance and bytes diff with FastCDC chunking WIP: librz/diff: speedup distance and bytes diff with FastCDC chunking Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants