perf: revisit varrangechecker #2401

Tuanlinh12312 · 2026-02-09T20:46:09Z

Resolves INT-5965.

Closes INT-4830

After merging, I will rebase `develop-v1.6.0` onto `develop-new-hintstore`

closes INT-5318 INT-5320 INT-5392 --------- Co-authored-by: Stephen Hwang <stephenh@intrinsictech.xyz>

Towards INT-5299

Resolves INT-5394

Co-authored-by: stephenh-axiom-xyz <stephenh@intrinsictech.xyz>

- add a `generate_gpu_memory_chart` function to generate gpu memory usage chart and table - [sample markdown](https://hackmd.io/yugFLvPqSm2HraUqDvcFHQ?view#GPU-Memory-Usage)

Updating `openvm-prof` to handle the new metrics from openvm-org/stark-backend#184 I have tested it locally.

- replace `SegmentationLimits` with `SegmentationConfig` in `SystemConfig` - add a `interaction_cell_weight` parameter to `SegmentationConfig` that specifies how much cells does an interaction contribute at each row

`main_cells_used` is inaccurate because trace height calculations are not accurate when cuda tracegen is enabled. to avoid confusion, don't emit these metrics

Resolves INT-5611

For large metrics, mermaid is too much text. Also switched to outputting detailed metrics in a separate markdown file. In the benchmark CI, we still cat the detailed metrics back into the main markdown file. The svg chart is uploaded to public s3 similar to flamegraphs so they can be viewed from the markdown. For later: - I feel like we can switch to having the detailed metrics be stored in a sqlite file. That way it can be downloaded and processed more easily for complex metrics. For now I just split it into a separate markdown for simplicity.

closes INT-5391

segment_ctx.rs: - DEFAULT_MAX_CELLS → DEFAULT_MAX_MEMORY = 15gb - max_cells → max_memory in SegmentationLimits - set_max_cells → set_max_memory ctx.rs: - with_max_cells → with_max_memory metered_cost.rs: - Updated import to use DEFAULT_MAX_MEMORY cli/src/commands/prove.rs: - Updated import to DEFAULT_MAX_MEMORY - segment_max_cells → segment_max_memory - with_max_cells → with_max_memory benchmarks/prove/src/bin/async_regex.rs: - segment_max_cells → segment_max_memory - set_max_cells → set_max_memory benchmarks/prove/src/util.rs: - segment_max_cells → segment_max_memory - set_max_cells → set_max_memory --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Due to the difference in error types, switched to using `should_panic` for now. We should go through later and switch everything back to the precise error types. closes INT-5904

Resolves INT-5850.

Also shows delta for "Parallel Proof Time (N provers)" column

Resolves INT-5851.

) Resolves INT-5905.

Resolves INT-5853 and INT-5906.

…n] (#2366) Resolves INT-5852.

…and ECC (#2372) - **Replace expensive BigUint computation during preflight with fast native field arithmetic** (halo2curves/blstrs) for all known field types (K256, P256, BN254, BLS12-381) and ECC curve operations. The trace filler already re-executes with BigUint for constraint generation, so preflight only needs to compute outputs for memory writes. - **Cache modulus constants** with `once_cell::Lazy<BigUint>` to eliminate repeated hex string parsing in `get_field_type()`/`get_fp2_field_type()` and `get_curve_type()` (previously called on every instruction). - **Cache `FieldType`/`CurveType` on executor structs** at construction time, eliminating per-instruction BigUint comparisons in preflight. - **Remove `DynArray` heap allocations** in preflight by using stack-allocated typed arrays directly from adapter read/write, with `as_flattened()` for zero-cost conversions. - **Add `adapter()` accessor** to `FieldExpressionExecutor` for use by custom `PreflightExecutor` implementations. SETUP operations and unknown field types fall back to `run_field_expression_precomputed`. - [x] `cargo nextest run -p openvm-algebra-circuit` — all 18 non-pre-existing-failure tests pass (8 modular addsub/muldiv, 2 is_equal positive, 8 fp2_chip) - [x] `cargo nextest run -p openvm-ecc-circuit` — all 8 tests pass (3 add_ne, 5 double including nonzero_a) - [x] `cargo clippy -p openvm-algebra-circuit -p openvm-ecc-circuit --all-targets` — no new warnings 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

keccak (`p3_inner_tracegen`) - force `__noinline__` - get -92% stack size BigUintGPU::mod_div - force `__noinline__` - get -85% stack size sha256 (first and second pass) - -76% & -90%: - `generate_block_trace`, `generate_missing_cells` → `__noinline__` - `generate_carry_ae`, `generate_intermed_4`, `generate_intermed_12` → Compute on-the-fly The goal: reduce memory peak (and get close to mem tracker report) Was +2 GB -> reduced to +0.9 GB Should be tested on various blocks (was tested on 21M)

stephenh-axiom-xyz · 2026-02-10T14:49:33Z

@Tuanlinh12312 please clean up this PR

Remove selector_inverse and is_not_wrap columns from VariableRangeCheckerAir by using a "monotonic sum" constraint approach instead of selector-based branching. This reduces trace width from 6 to 4 columns while maintaining max constraint degree of 2. Key insight: (value + two_to_max_bits) equals (row_index + 1), forming a strictly increasing sequence. By constraining this sum to increase by exactly 1 each row, combined with constraints that value can only be 0 or increment and max_bits can only stay or increment, the trace is forced into the correct enumeration. The last-row constraint acts as a checksum. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add missing imports (VerificationError, BabyBearBlake3Engine, StarkFriEngine) and remove unused dead code to allow circuit-primitives tests to compile. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

claude · 2026-02-10T15:41:18Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Tuanlinh12312 requested review from jonathanpwang, stephenh-axiom-xyz and zlangley February 9, 2026 20:50

Maillew and others added 27 commits February 9, 2026 21:18

fix: fix unbounded trace row generation for rv32im_hintbuffer (#2289)

b21a37a

Closes INT-4830

fix: fix unbounded trace row generation for rv32im_hintbuffer (#2289)

765a0ce

Closes INT-4830

chore: lower MAX_HINT_BUFFER_WORDS_BITS to 10 (#2345)

8e5011f

After merging, I will rebase `develop-v1.6.0` onto `develop-new-hintstore`

chore: switch openvm-circuit to v2 backend (#2203)

88240b2

closes INT-5318 INT-5320 INT-5392 --------- Co-authored-by: Stephen Hwang <stephenh@intrinsictech.xyz>

feat: update rv32im extension to v2 (failing negative tests) (#2204)

89c34ee

Towards INT-5299

feat: switch circuit extensions to v2 (#2212)

135849c

Resolves INT-5394

chore: fix metrics

7514511

chore: Encode and Decode for UserPublicValuesProof (#2239)

01b9972

Co-authored-by: stephenh-axiom-xyz <stephenh@intrinsictech.xyz>

fix: CUDA tracegen cannot have common_main = None for v1_shim

aada22f

fix: some CUDA compiler failures

a3c974c

chore: update CommittedTraceDataV2 to include trace again (#2258)

04590ee

chore: update CommittedTraceDataV2 to include trace again (#2258)

13325f8

feat(prof): add memory tracker chart to metrics md (#2282)

8e4570a

- add a `generate_gpu_memory_chart` function to generate gpu memory usage chart and table - [sample markdown](https://hackmd.io/yugFLvPqSm2HraUqDvcFHQ?view#GPU-Memory-Usage)

fix: remove delta tracking (#2283)

20d18cc

feat(prof): update for local_peak gpu memory metric (#2284)

4d2d7e3

Updating `openvm-prof` to handle the new metrics from openvm-org/stark-backend#184 I have tested it locally.

chore: add v2 metrics

b1e9ffb

chore: update prof to properly handle case with no labels

b884c99

chore: update prof with mechanism to skip outputting certain metrics

c8c068b

chore: generate_blob aggregate metrics in prof

a5cbc52

feat: separate aggregated metrics by phase (#2291)

7572ffb

chore: aggregate more prover spans (#2294)

0392279

chore: fix span name (#2295)

f8a84df

chore: rename metric spans to reflect nesting (#2305)

a65b70e

feat: add parameter for interaction contribution to cell count (#2309)

8bc74bb

- replace `SegmentationLimits` with `SegmentationConfig` in `SystemConfig` - add a `interaction_cell_weight` parameter to `SegmentationConfig` that specifies how much cells does an interaction contribute at each row

chore: remove incorrect metrics when cuda enabled (#2310)

9a950e8

`main_cells_used` is inaccurate because trace height calculations are not accurate when cuda tracegen is enabled. to avoid confusion, don't emit these metrics

fix: total cell calculation (#2311)

536ced3

chore: clippy and update stark-backend (#2312)

2c2d967

shuklaayush and others added 23 commits February 9, 2026 21:18

fix: avoid implicit copies in segmentation structs (#2313)

d374456

feat: add parallel proof time for fixed number of devices (#2314)

820e6ed

Resolves INT-5611

fix: hybrid gpu chips should transpose on gpu (#2316)

16a1a80

fix(develop-v2): byte conversion in openvm-prof (#2317)

05fe01a

chore: reduce number of threads in divrem kernel (#2319)

5a8b934

feat: add back debug_proving_ctx (#2329)

e21e59a

closes INT-5391

fix: test-utils for aot

f1dcd73

chore: remove rebase artifact

cc31851

chore(segmentation): add secondary weight for main cells (#2343)

b4e6422

fix(v2): circuit primitives unit tests (#2347)

b60d624

Due to the difference in error types, switched to using `should_panic` for now. We should go through later and switch everything back to the precise error types. closes INT-5904

perf: remove preprocessed trace from VmConnectorAir (#2350)

c0d24b6

Resolves INT-5850.

fix: clean up metrics reporting for cuda & recursion (#2355)

8031279

Also shows delta for "Parallel Proof Time (N provers)" column

perf: remove preprocessed trace from VarRangeCheckerAIR (#2357)

94e5fc2

Resolves INT-5851.

perf: remove preprocessed trace varrangecheckerair [gpu tracegen] (#2359

d8b8aba

) Resolves INT-5905.

chore: update interaction cell weight (#2367)

a25f620

perf: remove preprocessed trace from RangeTupleCheckerAir (#2358)

e99ec7d

Resolves INT-5853 and INT-5906.

perf: remove preprocessed trace bitwiseoplookupair [cpu & gpu tracege…

965059e

…n] (#2366) Resolves INT-5852.

fix: set is_required on v1 chips (#2373)

fc993a0

fix: plonky3 trait updates

61263de

branch-rebase-bot bot force-pushed the develop-v2 branch from a0ab899 to 61263de Compare February 9, 2026 21:18

Tuanlinh12312 and others added 2 commits February 10, 2026 15:32

fix(range_tuple): add missing imports for test compilation

1cabea1

Add missing imports (VerificationError, BabyBearBlake3Engine, StarkFriEngine) and remove unused dead code to allow circuit-primitives tests to compile. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Tuanlinh12312 force-pushed the perf/revisit-varrangechecker branch from efe9ccd to 1cabea1 Compare February 10, 2026 15:32

branch-rebase-bot bot force-pushed the develop-v2 branch from 61263de to a98d800 Compare February 11, 2026 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: revisit varrangechecker #2401

perf: revisit varrangechecker #2401

Uh oh!

Tuanlinh12312 commented Feb 9, 2026

Uh oh!

stephenh-axiom-xyz commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

perf: revisit varrangechecker #2401

Are you sure you want to change the base?

perf: revisit varrangechecker #2401

Uh oh!

Conversation

Tuanlinh12312 commented Feb 9, 2026

Uh oh!

stephenh-axiom-xyz commented Feb 10, 2026

Uh oh!

claude bot commented Feb 10, 2026

Code review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants