Skip to content

Conversation

@shreyas-londhe
Copy link
Member

@shreyas-londhe shreyas-londhe commented Mar 28, 2025

Note

Replaces the old Rust compiler and extensive Circom tests with a new Bun/TypeScript scripts suite (gen-regex, gen-inputs), shared utilities, and Jest-based tests.

  • Tooling/Scripts (new):
    • Add Bun/TypeScript scripts for generating regex artifacts: circom/scripts/gen-regex.ts, noir/scripts/gen-regex.ts, noir/scripts/gen-inputs.ts.
    • Introduce shared utilities (utils/*) for logging, file ops, subprocess, and types.
    • Add project config: tsconfig.json, jest.config.ts, jest.setup.ts, bun.lock, and package.json.
  • Tests (new):
    • Add Jest integration and unit tests under scripts/__tests__ for script workflows and utilities.
  • Compiler (removed):
    • Remove legacy Rust compiler package (packages/compiler/*).
  • Circom Tests (removed):
    • Remove numerous legacy Circom test circuits and test cases under packages/circom/tests/*.

Written by Cursor Bugbot for commit 3d37c31. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features
    • Adds many new regex verification circuits for email headers and addresses and a compiler to generate them; Noir support added alongside Circom.
  • Infrastructure & Build
    • New CI/workflow and pre-push hook to keep generated templates in sync; new package/tsconfig/bundler configs for tooling.
  • Documentation
    • Major README overhaul and new docs for compiler, Circom, and Noir subprojects with usage and troubleshooting.
  • Tests
    • End-to-end circuit tests and sample inputs added for common email regexes.
  • Chores
    • Simplified lint/format configs and updated workspace/manifest and .gitignore rules.

@BornPsych
Copy link

@cursor please review this

@cursor
Copy link

cursor bot commented Sep 30, 2025

I've thoroughly reviewed the "Feat/new compiler" PR. I started by exploring the codebase, examining package.json files for dependency changes, and then delved into the compiler's structure, error handling, and test files. I also paid close attention to the Socket security report and the Cargo.toml files for Rust dependencies.

Here's a summary of my findings:

This PR introduces a substantial new ZK-Regex compiler with a well-structured architecture, robust error handling, and support for multiple proving frameworks (Circom and Noir). Key strengths include:

  • Architecture & Design: Modular structure, clear API, and support for multiple frameworks.
  • Error Handling: Comprehensive error taxonomy with user-friendly messages.
  • Developer Experience: Excellent CLI tool, automatic template regeneration, and good documentation.
  • Code Quality: Follows Rust best practices with clean abstractions.

However, I identified several areas for improvement and potential issues:

  • Security Concerns: The Socket report highlighted the addition of @zk-email/[email protected] and an upgrade of circom_tester. Using alpha versions in production is a risk.
  • Test Coverage: I found no unit tests in the Rust compiler code, and integration tests appear limited.
  • Dependencies: There's a dependency on a specific GitHub branch (comptime from feat/comptime-codegen), which could lead to instability.
  • Documentation: Some inline documentation is missing, and there's no architecture diagram.

My main recommendations are to:

  • Add comprehensive unit tests for core compiler functionality.
  • Replace alpha dependencies with stable versions.
  • Address the external GitHub branch dependency.
  • Add performance benchmarks.

Overall, this is a high-quality PR, but the critical issues regarding tests and dependencies should be addressed before merging.


5 tasks completed

  • Review PR changes for new compiler feature
  • Analyze new dependencies added (@types/bun, @zk-email/circuits)
  • Review Socket security report findings
  • Verify new compiler implementation and integration
  • Provide comprehensive review feedback
    Open in Cursor Open in Web

Learn more about Cursor Agents

…oups inside of public patterns into non capture
…group conversion

The convert_capturing_to_non_capturing function was naively converting
all '(' to '(?:' without considering:
- Escaped parentheses like \( which match literal parens
- Parentheses inside character classes like [()] which are literals

This caused regex semantics to break. For example:
- \(foo\) would become \(?:foo\) (wrong)
- [()] would become [(?:)] (matches extra chars)

Fixed by tracking two states during string scan:
- in_char_class: are we inside [...]?
- escaped: was previous char a backslash?

Only convert ( to (?:) when it's not escaped, not in a char class,
and not already followed by ?.

Added 5 new tests covering escaped parens, character classes, and
mixed cases. All 23 tests pass.

Addresses PR review feedback: #109 (comment)
Extended convert_capturing_to_non_capturing to handle both PCRE-style
(?<name>...) and Rust-style (?P<name>...) named capture groups, not
just bare capturing groups. This prevents named captures in private
patterns from interfering with public capture group numbering.

Key changes:
- Detect and convert both named capture syntaxes to (?:...)
- Distinguish lookbehind assertions (?<=, ?<!) from named captures
- Updated test_preserve_named_groups to test_convert_pcre_named_groups
  with correct expected behavior
- Added 7 new tests covering both named capture styles and edge cases

Fixes code review finding that named captures were being preserved
instead of converted, which would cause capture group numbering
mismatches in generated circuits.
…ure-groups-inside-of-private-patterns

Rutefig/reg 574 fix prevent capture groups inside of private patterns
@coderabbitai
Copy link

coderabbitai bot commented Oct 31, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a Rust-based zk-regex compiler, backend generators for Circom and Noir, IR and pass pipelines, wasm bindings, Circom/Noir artifacts and tests, developer tooling (scripts, hooks, CI), many generated Circom circuits/graphs/regexes, and project configuration/file removals and reorganizations.

Changes

Cohort / File(s) Summary
Removed configs
\.cargo/config`, `.eslintrc.json`, `.eslintignore`, babel.config.js, .prettierrc`
Deleted legacy tooling/config files (cargo, ESLint, Babel, Prettier)
Repo config & ignores
\Cargo.toml`, `.gitignore`, bunfig.toml``
Reworked workspace Cargo manifest, updated .gitignore entries, added Bun test timeout config
CI workflows
\.github/workflows/test.yaml, \.github/workflows/test.yml
Removed old workflow and added new Rust/Bun-based GitHub Actions workflow (Circom install, Bun runner, tests)
Root docs & metadata
\README.md`, `LICENSE``
Large README rewrite and minor license text tweak
Compiler crate & packaging
compiler/Cargo.toml, compiler/package.json, compiler/README.md
New compiler crate manifest, npm package metadata, and compiler README
Compiler library & API
compiler/src/lib.rs, compiler/src/driver.rs, compiler/src/error.rs, compiler/src/types.rs, compiler/src/wasm.rs, compiler/src/utils.rs
New public compiler API, structured errors, wasm bindings, utilities for composing/generating outputs
IR: NFA & intermediate
compiler/src/ir/* (graph.rs, intermediate.rs, nfa.rs, mod.rs)
New IR: NFANode/NFAGraph, IntermediateNFA, epsilon-elimination, path-finding, verification and serialization
Passes & errors
compiler/src/passes/* (builder.rs, error.rs, mod.rs)
New builder pass converting Thompson NFA → Intermediate → NFAGraph and NFA error types mapped to CompilerError
Backend generators
compiler/src/backend/* (mod.rs, circom.rs, noir.rs, shared.rs)
Circom and Noir code generators and shared extraction logic for transitions/captures; Circom code emitter API
CLI binary
compiler/src/bin/zk-regex.rs
New CLI with subcommands: Decomposed, Raw, GenerateCircuitInput
Gen / scripting
circom/scripts/gen-regex.ts, git-hooks/pre-push
Script to generate Circom circuits from JSON using the compiler; pre-push hook enforcing template regeneration
Circom package & configs
circom/package.json, circom/tsconfig.json, circom/bunfig.toml, circom/index.ts, circom/.gitignore
New Circom package manifest, TS config, Bun config, entry log, and Circom-specific .gitignore
Circom regex helpers & templates
circom/circuits/regex_helpers.circom, circom/circuits/common/*.circom
Added MultiOR, transition checks, CaptureSubstring and many generated Circom templates implementing DFA-based regexes (email, subject, body_hash, message_id, to/from, simple, reversed_bracket, etc.)
Circom DFA graphs & regexes
circom/circuits/common/*_graph.json, circom/regexes/*.json
Added generated DFA JSON graphs and source regex JSON descriptors for each pattern
Circom tests
circom/circuits/tests/*, circom/circuits/tests/*.test.ts
Added Circom test circuits and TypeScript test suites exercising generated circuits (body_hash, email_addr, subject_all, etc.)
Noir integration & samples
noir/Nargo.toml, noir/README.md, noir/common/*.json, noir/common/sample_haystacks/**
Added Noir project manifest, docs, pattern JSONs, and sample haystacks / circuit input traces
Repo tooling & docs
CLAUDE.md, circom/README.md
Added developer documentation and Circom-specific README

Sequence Diagram(s)

%%{init: { "themeVariables": { "actorBkg": "#E8F6FF", "actorBorder": "#8CCBEA", "noteBkg": "#F6F8FA" }}}%%
sequenceDiagram
    participant User
    participant CLI as zk-regex CLI
    participant Driver
    participant Builder as NFABuilder
    participant Backend as CodeGen
    participant Filesystem

    User->>CLI: gen-from-raw(pattern, template, framework)
    CLI->>Driver: Driver::compile(pattern, config)
    Driver->>Builder: NFAGraph::build(pattern)
    Builder->>Builder: Thompson NFA → IntermediateNFA → NFAGraph
    Builder->>Driver: NFAGraph (graph JSON)
    Driver->>Backend: generate_code(nfa, template, pattern, max_bytes)
    Backend->>Filesystem: emit Circom/Noir code & graph
    Filesystem-->>User: (graph.json, template.circom / template.nr)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Areas requiring extra attention:

  • compiler/src/ir/* (IntermediateNFA, epsilon elimination, path-finding)
  • compiler/src/backend/* (Circom/Noir emission correctness, capture handling)
  • compiler/src/wasm.rs (error translation & JS bindings)
  • circom/circuits/common/*_regex.circom (per-byte transition logic and captures)
  • git-hooks/pre-push (stashing, build-and-regenerate flow)

Possibly related PRs

Suggested reviewers

  • BornPsych

Poem

🐇 In burrows of bytes I nimbly dig,

Patterns to circuits, neat and big.
States hop, captures snug and warm,
Proofs bloom from regex into storm —
Hooray! A compiled trail, a rabbit's jig. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Feat/new compiler' is vague and does not clearly convey the main changes in the PR. While it mentions a compiler feature, it fails to specify that this replaces the old compiler, introduces Bun/TypeScript scripts, and significantly restructures the tooling. Consider a more descriptive title such as 'Rewrite compiler in Rust with Bun/TypeScript tooling' or 'Replace legacy compiler with new Rust-based architecture' to better communicate the scope and nature of the changes.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/new-compiler

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is being reviewed by Cursor Bugbot

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

eprintln!("Supported frameworks: circom, noir");
std::process::exit(1);
}
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

The proving_framework argument is parsed inconsistently across commands. While Decomposed and Raw use ProvingFramework::from_str, GenerateCircuitInput manually parses it with a match statement. This duplicates logic and creates maintenance challenges, potentially leading to behavioral inconsistencies.

Fix in Cursor Fix in Web

}

// Run the script if this file is executed directly
if (import.meta.url === `file://${process.argv[1]}`) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Bug

The direct execution check for the main function in gen-inputs.ts and gen-regex.ts uses an incorrect comparison between import.meta.url (a URL string) and process.argv[1] (a file path). This mismatch means the scripts' main functions never run when executed directly.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 25

🧹 Nitpick comments (12)
circom/regexes/reversed_bracket.json (1)

5-5: Consider whether greedy matching is intentional.

The pattern <.* uses greedy matching (.*) which will match everything after the < character. If the intent is to match content only up to the next delimiter or end of string, you might want to use a more constrained pattern.

However, if the greedy behavior is intentional for your use case, this is fine as-is.

circom/circuits/common/to_all_graph.json (1)

1-50: Clarify generation and maintenance process for DFA graph file.

This 2256-line JSON file encodes a deterministic finite automaton (DFA) for the regex pattern (?:\r\n|^)to:([^\r\n]+)\r\n. The file structure suggests automated generation rather than manual authoring.

Key concerns:

  1. Is this auto-generated? If yes, add build artifacts to .gitignore or document the generation process.
  2. Maintenance model: If manually maintained, this is unmaintainable. If generated, provide:
    • Documentation on the generator tool
    • Instructions for regeneration when regex patterns change
    • Validation that generated states are correct and complete
  3. Reachability verification: Confirm all states are reachable and accept_states are terminal.

Without clarity on ownership, future maintainers won't know how to update this when regex patterns change.

I can help document the generation process or create a verification script. Would you like me to generate a script to validate DFA reachability and accept state correctness?

circom/regexes/message_id.json (1)

1-7: Consider expanding the message-id character class to align with RFC 5322.

The PublicPattern on line 4 uses <[A-Za-z0-9=@\\.\\+_-]+>, which may be too restrictive. RFC 5322 allows additional characters in message-id local and domain parts, such as colons (:) and slashes (/). Modern message-ids may include these characters.

If stricter validation is intended, this is acceptable. Otherwise, consider expanding the pattern:

-        { "PublicPattern": ["<[A-Za-z0-9=@\\.\\+_-]+>", 128] },
+        { "PublicPattern": ["<[A-Za-z0-9=@\\.\\+_\\-:/]+>", 128] },
circom/README.md (1)

1-62: LGTM - Comprehensive documentation with minor style improvements available.

The README provides clear installation instructions, prerequisites, and troubleshooting guidance. The content is accurate and aligns well with the Circom integration introduced in this PR.

For improved markdown consistency, consider fixing the indentation issues flagged by markdownlint. Nested list items should use 2-space indentation instead of 4-space:

  • Lines 13-14, 19-23, 27-31, 35-40 have 4-space indentation
  • Line 51 uses emphasis instead of a proper heading

These are purely stylistic and can be addressed in a future cleanup if desired.

noir/common/message_id.json (1)

1-7: Pattern duplication with circom/regexes/message_id.json.

This file is identical to circom/regexes/message_id.json. While having separate regex definitions for Circom and Noir toolchains is reasonable, consider whether these could be maintained in a single source to prevent drift.

Additionally, the same character class limitation applies here: the pattern may be too restrictive for modern message-ids per RFC 5322. See the review comment on circom/regexes/message_id.json for details.

compiler/README.md (2)

13-40: Fix markdown list indentation to comply with MD007.

Lines use 4-space indentation for nested list items; markdown style guide requires 2 spaces. Update the list indentation throughout (lines 13–40, 51, 65, 81, 97) for consistency.

-    -   **`compile(pattern: &str) -> Result<NFAGraph, CompilerError>`**
-        Parses the input regex `pattern` string.
+  -   **`compile(pattern: &str) -> Result<NFAGraph, CompilerError>`**
+      Parses the input regex `pattern` string.

51-51: Replace bold text with heading syntax for "Examples" sections (MD036).

Lines 51, 65, 81, and 97 use bold text (e.g., **Example 1:**) instead of proper heading syntax. Convert to heading level (e.g., ### Example 1:) for better accessibility and document structure.

-**Example 1: Compile a simple regex to NFA**
+### Example 1: Compile a simple regex to NFA

Also applies to: 65-65, 81-81, 97-97

compiler/src/utils.rs (2)

168-196: Consider validating the file_extension parameter.

The function accepts file_extension without validation. Invalid extensions could create confusion (e.g., extensions with path separators, empty strings, or non-standard characters).

Add basic validation:

 pub fn save_outputs(
     nfa: &NFAGraph,
     circom_code: String,
     output_dir: &PathBuf,
     template_name: &str,
     file_extension: &str,
 ) -> Result<(), Box<dyn std::error::Error>> {
     validate_cli_template_name(template_name)?;
+    
+    // Validate file extension
+    if file_extension.is_empty() || file_extension.contains('/') || file_extension.contains('\\') {
+        return Err("Invalid file extension".into());
+    }

     // Create output directory if it doesn't exist

198-464: Strong test coverage! Consider adding edge case tests.

The test suite is comprehensive and well-organized. To complement the issues raised above, consider adding tests for:

  • Escaped brackets in character classes: r"[\[]" and r"[\]]"
  • Character class starting with ]: r"[]a]"
  • Malformed named captures: r"(?<name...)" (missing >)
  • Empty template name validation
compiler/src/wasm.rs (1)

155-189: Minor optimization: avoid cloning max_substring_bytes twice.

The code clones max_substring_bytes on line 161 (and 180 for Noir), but then the original is also referenced in the closure. Since it's only Option<Vec<usize>>, the clone is cheap, but the pattern is slightly redundant.

Consider restructuring to avoid the clone:

+    let num_capture_groups = nfa.num_capture_groups;
+    let max_bytes_len = max_substring_bytes.as_ref().map(|v| v.len()).unwrap_or(0);
+
     let code = match proving_framework {
         ProvingFramework::Circom => {
-            let max_bytes_clone = config.max_bytes.clone();
             crate::backend::generate_circom_code(
                 &nfa,
                 &template_name.0,
                 &raw_regex.0,
-                max_substring_bytes.clone(),
+                max_substring_bytes,
             )
             .map_err(|nfa_err| {
                 match nfa_err {
                     crate::passes::NFAError::InvalidCapture(_) => {
                         CompilerError::invalid_capture_config(
-                            nfa.num_capture_groups,
-                            max_substring_bytes.as_ref().map(|v| v.len()).unwrap_or(0),
+                            num_capture_groups,
+                            max_bytes_len,
                         )
                     }
compiler/src/lib.rs (1)

73-88: Consider exposing the optimize parameter.

The optimize flag is hardcoded to true in the CompilationConfig. Depending on use cases, users might want to disable optimization for debugging or to compare output.

Consider adding an optional parameter:

pub fn gen_from_raw(
    pattern: &str,
    max_bytes: Option<Vec<usize>>,
    template_name: &str,
    proving_framework: ProvingFramework,
    optimize: Option<bool>,
) -> CompilerResult<(NFAGraph, String)> {
    let config = CompilationConfig {
        template_name: template_name.to_string(),
        proving_framework,
        max_bytes,
        optimize: optimize.unwrap_or(true),
    };
    // ...
}
compiler/src/ir/intermediate.rs (1)

340-396: State compaction logic is correct but clones nodes.

The compact_states method correctly remaps all state indices and transitions. The node cloning on line 353 is safe but could be optimized if this becomes a performance bottleneck.

If performance is critical, consider using std::mem::take:

let mut node = std::mem::take(&mut self.nodes[old_idx]);

Though this would require making self.nodes temporarily inconsistent, so the current clone approach is safer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4fb1bda and 3d37c31.

⛔ Files ignored due to path filters (5)
  • .DS_Store is excluded by !**/.DS_Store
  • Cargo.lock is excluded by !**/*.lock
  • bun.lock is excluded by !**/*.lock
  • circom/bun.lock is excluded by !**/*.lock
  • packages/circom/pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (78)
  • .cargo/config (0 hunks)
  • .eslintignore (0 hunks)
  • .eslintrc.json (0 hunks)
  • .github/workflows/test.yaml (0 hunks)
  • .github/workflows/test.yml (1 hunks)
  • .gitignore (1 hunks)
  • .prettierrc (0 hunks)
  • Cargo.toml (1 hunks)
  • LICENSE (1 hunks)
  • README.md (2 hunks)
  • babel.config.js (0 hunks)
  • bunfig.toml (1 hunks)
  • circom/.gitignore (1 hunks)
  • circom/README.md (1 hunks)
  • circom/bunfig.toml (1 hunks)
  • circom/circuits/common/body_hash_regex.circom (1 hunks)
  • circom/circuits/common/email_addr_graph.json (1 hunks)
  • circom/circuits/common/email_addr_regex.circom (1 hunks)
  • circom/circuits/common/email_domain_graph.json (1 hunks)
  • circom/circuits/common/email_domain_regex.circom (1 hunks)
  • circom/circuits/common/from_all_graph.json (1 hunks)
  • circom/circuits/common/from_all_regex.circom (1 hunks)
  • circom/circuits/common/message_id_graph.json (1 hunks)
  • circom/circuits/common/message_id_regex.circom (1 hunks)
  • circom/circuits/common/reversed_bracket_regex.circom (1 hunks)
  • circom/circuits/common/simple_graph.json (1 hunks)
  • circom/circuits/common/simple_regex.circom (1 hunks)
  • circom/circuits/common/subject_all_graph.json (1 hunks)
  • circom/circuits/common/subject_all_regex.circom (1 hunks)
  • circom/circuits/common/to_all_graph.json (1 hunks)
  • circom/circuits/common/to_all_regex.circom (1 hunks)
  • circom/circuits/regex_helpers.circom (1 hunks)
  • circom/circuits/tests/body_hash.test.ts (1 hunks)
  • circom/circuits/tests/circuits/test_body_hash_regex.circom (1 hunks)
  • circom/index.ts (1 hunks)
  • circom/package.json (1 hunks)
  • circom/regexes/body_hash.json (1 hunks)
  • circom/regexes/email_addr.json (1 hunks)
  • circom/regexes/email_domain.json (1 hunks)
  • circom/regexes/from_all.json (1 hunks)
  • circom/regexes/message_id.json (1 hunks)
  • circom/regexes/reversed_bracket.json (1 hunks)
  • circom/regexes/simple.json (1 hunks)
  • circom/regexes/subject_all.json (1 hunks)
  • circom/regexes/to_all.json (1 hunks)
  • circom/scripts/gen-regex.ts (1 hunks)
  • circom/tsconfig.json (1 hunks)
  • compiler/Cargo.toml (1 hunks)
  • compiler/README.md (1 hunks)
  • compiler/package.json (1 hunks)
  • compiler/src/backend/circom.rs (1 hunks)
  • compiler/src/backend/mod.rs (1 hunks)
  • compiler/src/backend/noir.rs (1 hunks)
  • compiler/src/backend/shared.rs (1 hunks)
  • compiler/src/bin/zk-regex.rs (1 hunks)
  • compiler/src/driver.rs (1 hunks)
  • compiler/src/error.rs (1 hunks)
  • compiler/src/ir/graph.rs (1 hunks)
  • compiler/src/ir/intermediate.rs (1 hunks)
  • compiler/src/ir/mod.rs (1 hunks)
  • compiler/src/ir/nfa.rs (1 hunks)
  • compiler/src/lib.rs (1 hunks)
  • compiler/src/passes/builder.rs (1 hunks)
  • compiler/src/passes/error.rs (1 hunks)
  • compiler/src/passes/mod.rs (1 hunks)
  • compiler/src/types.rs (1 hunks)
  • compiler/src/utils.rs (1 hunks)
  • compiler/src/wasm.rs (1 hunks)
  • git-hooks/pre-push (1 hunks)
  • noir/Nargo.toml (1 hunks)
  • noir/README.md (1 hunks)
  • noir/common/body_hash.json (1 hunks)
  • noir/common/email_addr.json (1 hunks)
  • noir/common/email_domain.json (1 hunks)
  • noir/common/from_all.json (1 hunks)
  • noir/common/message_id.json (1 hunks)
  • noir/common/sample_haystacks/body_hash.json (1 hunks)
  • noir/common/sample_haystacks/circuit_inputs/body_hash_pass_0.json (1 hunks)
💤 Files with no reviewable changes (6)
  • .github/workflows/test.yaml
  • .prettierrc
  • .eslintignore
  • .eslintrc.json
  • babel.config.js
  • .cargo/config
🧰 Additional context used
🧬 Code graph analysis (16)
compiler/src/passes/builder.rs (3)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (2)
  • nfa (191-193)
  • nfa (195-197)
compiler/src/ir/intermediate.rs (1)
  • new (53-61)
compiler/src/ir/nfa.rs (1)
compiler/src/ir/intermediate.rs (1)
  • new (53-61)
compiler/src/ir/mod.rs (2)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (2)
  • nfa (191-193)
  • nfa (195-197)
compiler/src/bin/zk-regex.rs (6)
compiler/src/lib.rs (3)
  • gen_circuit_inputs (160-297)
  • gen_from_decomposed (114-126)
  • gen_from_raw (73-88)
compiler/src/utils.rs (2)
  • save_outputs (168-196)
  • validate_cli_template_name (156-166)
compiler/src/types.rs (1)
  • from_str (51-59)
compiler/src/error.rs (2)
  • code (139-148)
  • user_message (151-224)
compiler/src/ir/nfa.rs (1)
  • from_json (80-84)
compiler/src/passes/error.rs (1)
  • from (53-166)
compiler/src/backend/shared.rs (5)
compiler/src/passes/error.rs (1)
  • from (53-166)
compiler/src/backend/noir.rs (8)
  • inputs (272-276)
  • inputs (283-287)
  • inputs (291-295)
  • nfa (191-193)
  • nfa (195-197)
  • start_states (350-353)
  • accept_states (379-382)
  • start (430-430)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/ir/intermediate.rs (1)
  • new (53-61)
compiler/src/ir/nfa.rs (1)
  • new (63-71)
circom/circuits/tests/body_hash.test.ts (1)
compiler/src/wasm.rs (1)
  • genCircuitInputs (198-217)
compiler/src/driver.rs (6)
compiler/src/lib.rs (1)
  • compile (39-41)
compiler/src/backend/circom.rs (3)
  • nfa (229-231)
  • nfa (233-235)
  • generate_circom_code (62-432)
compiler/src/backend/noir.rs (3)
  • nfa (191-193)
  • nfa (195-197)
  • generate_noir_code (30-265)
compiler/src/passes/builder.rs (1)
  • build (21-37)
compiler/src/error.rs (3)
  • code (139-148)
  • invalid_capture_config (278-290)
  • circuit_generation_failed (268-276)
compiler/src/passes/error.rs (1)
  • from (53-166)
circom/scripts/gen-regex.ts (6)
compiler/src/bin/zk-regex.rs (1)
  • main (83-225)
scripts/utils/logger.ts (2)
  • logger (41-41)
  • error (36-38)
scripts/utils/file-operations.ts (3)
  • fileExists (75-82)
  • ensureDirectory (24-30)
  • listFilesWithExtension (99-104)
scripts/utils/types.ts (1)
  • ScriptError (69-74)
scripts/utils/string-utils.ts (1)
  • toPascalCase (15-19)
scripts/utils/subprocess.ts (1)
  • executeCargo (196-267)
compiler/src/lib.rs (6)
compiler/src/driver.rs (1)
  • compile (59-130)
compiler/src/passes/builder.rs (1)
  • build (21-37)
compiler/src/backend/shared.rs (3)
  • from (33-44)
  • from (48-59)
  • generate_circuit_inputs (127-228)
compiler/src/passes/error.rs (1)
  • from (53-166)
compiler/src/utils.rs (1)
  • decomposed_to_composed_regex (131-154)
compiler/src/error.rs (2)
  • input_too_long (292-300)
  • no_match_found (302-310)
compiler/src/utils.rs (3)
compiler/src/ir/intermediate.rs (1)
  • new (53-61)
compiler/src/ir/nfa.rs (1)
  • new (63-71)
compiler/src/lib.rs (1)
  • gen_from_decomposed (114-126)
compiler/src/ir/graph.rs (5)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (3)
  • nfa (191-193)
  • nfa (195-197)
  • start (430-430)
compiler/src/ir/intermediate.rs (1)
  • new (53-61)
compiler/src/ir/nfa.rs (2)
  • new (63-71)
  • node (96-99)
compiler/src/backend/shared.rs (2)
  • path (163-163)
  • path (164-164)
compiler/src/passes/error.rs (2)
compiler/src/wasm.rs (5)
  • from (21-26)
  • from (30-42)
  • from (54-56)
  • from (60-62)
  • from (66-68)
compiler/src/error.rs (2)
  • nfa_construction_failed (259-266)
  • serialization_error (312-318)
compiler/src/backend/noir.rs (1)
compiler/src/backend/shared.rs (4)
  • escape_regex_for_display (230-243)
  • generate_circuit_data (62-125)
  • from (33-44)
  • from (48-59)
compiler/src/backend/circom.rs (1)
compiler/src/backend/shared.rs (2)
  • escape_regex_for_display (230-243)
  • generate_circuit_data (62-125)
compiler/src/ir/intermediate.rs (3)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (3)
  • nfa (191-193)
  • nfa (195-197)
  • start (430-430)
compiler/src/ir/nfa.rs (2)
  • new (63-71)
  • node (96-99)
compiler/src/wasm.rs (6)
compiler/src/backend/circom.rs (3)
  • generate_circom_code (62-432)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (6)
  • generate_noir_code (30-265)
  • nfa (191-193)
  • nfa (195-197)
  • inputs (272-276)
  • inputs (283-287)
  • inputs (291-295)
compiler/src/lib.rs (2)
  • compile (39-41)
  • gen_circuit_inputs (160-297)
compiler/src/utils.rs (1)
  • decomposed_to_composed_regex (131-154)
compiler/src/passes/error.rs (1)
  • from (53-166)
compiler/src/error.rs (3)
  • code (139-148)
  • serialization_error (312-318)
  • invalid_capture_config (278-290)
🪛 actionlint (1.7.8)
.github/workflows/test.yml

23-23: the runner of "actions/cache@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🪛 markdownlint-cli2 (0.18.1)
README.md

14-14: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)

compiler/README.md

13-13: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


20-20: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


21-21: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


27-27: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


28-28: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


29-29: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


30-30: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


31-31: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


36-36: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


39-39: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


40-40: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


51-51: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


65-65: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


81-81: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


97-97: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

circom/README.md

13-13: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


14-14: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


15-15: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


19-19: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


20-20: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


21-21: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


22-22: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


23-23: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


27-27: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


28-28: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


29-29: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


30-30: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


31-31: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


35-35: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


36-36: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


37-37: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


38-38: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


39-39: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


40-40: Unordered list indentation
Expected: 2; Actual: 4

(MD007, ul-indent)


51-51: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Socket Security: Pull Request Alerts
  • GitHub Check: Cursor Bugbot
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (34)
LICENSE (1)

674-674: URL standardization looks correct.

The reference to https://www.gnu.org/licenses/why-not-lgpl.html is the canonical FSF link for GPL v3 licensing guidance. This is a standard maintenance update with no license compliance implications.

noir/Nargo.toml (2)

1-5: Populate authors and review version constraints.

The authors field (line 4) is empty, which is a metadata quality issue. For published or team projects, author information should be populated. Additionally, the compiler_version constraint uses only a lower bound (>=1.0.0) with no upper limit, which could lead to compatibility issues if future Noir releases introduce breaking changes.

  • Please populate the authors field with actual author name(s), e.g., authors = ["Your Name"]
  • Consider adding an upper bound to compiler_version to maintain compatibility, e.g., compiler_version = ">=1.0.0, <2.0.0" (adjust based on stability guarantees)

7-8: Verify the dependency specification and git tag availability.

The sort dependency is sourced from a GitHub repository via a git tag (v0.2.3) rather than a version-pinned crates.io release. While this approach is valid in Noir, please confirm:

  1. The git repository URL and tag v0.2.3 are accessible and stable.
  2. This dependency sourcing approach aligns with the project's practices for managing external dependencies.
  3. The tag points to a compatible and stable version of the library.

If the dependency is well-maintained and the tag is stable, consider adding a comment explaining why a git-sourced dependency is preferred over a registry release (if available).

git-hooks/pre-push (1)

17-30: comment

circom/regexes/simple.json (1)

1-3: LGTM!

Valid JSON structure defining a simple regex pattern for testing the compiler pipeline.

noir/common/email_addr.json (1)

1-10: LGTM!

Email validation pattern with appropriate maximum length (320 characters per RFC 3696). The pattern includes standard email characters and properly escaped special characters.

noir/README.md (1)

1-20: LGTM!

Clear and helpful documentation with an important warning about test placement. The guidance prevents potential issues with automated test generation scripts.

circom/regexes/to_all.json (1)

1-7: LGTM!

Well-structured multi-part pattern for parsing email "To:" headers. Correctly handles CRLF line endings and constrains the captured content to a reasonable maximum length.

circom/circuits/common/simple_graph.json (1)

1-71: LGTM with observation.

The NFA graph correctly represents the regex "a*b" with proper state transitions. The multiple start states (0, 2, 3) are unusual but may be intentional for the compiler's graph representation or optimization strategy.

Note: Ensure the graph consumer correctly handles multiple start states if this is a generated artifact from the compiler pipeline.

circom/regexes/from_all.json (1)

1-7: LGTM — new regex pattern data asset.

This JSON file adds a regex definition for parsing email "from:" headers. The structure and patterns are consistent with the regex data assets introduced elsewhere in the PR.

noir/common/from_all.json (1)

1-7: LGTM — matching Noir regex pattern.

This mirrors the Circom version for cross-framework consistency. Both frameworks now share the same "from:" header parsing patterns.

compiler/Cargo.toml (1)

13-23: Verify workspace dependency versions in root Cargo.toml.

All dependencies are correctly delegated to the workspace. However, ensure the versions specified in the root workspace manifest are appropriate and not using unstable sources (except where necessary).

circom/regexes/body_hash.json (1)

1-8: LGTM — DKIM body hash regex pattern.

Well-structured regex for extracting DKIM body hash values. The base64 pattern and 44-character limit align with SHA256-based DKIM signatures.

circom/tsconfig.json (1)

1-27: LGTM — well-configured TypeScript settings.

The configuration enforces strict type checking while maintaining development flexibility. Modern settings for bundler-based workflows are appropriate.

noir/common/sample_haystacks/body_hash.json (1)

1-9: LGTM - Well-structured test fixture.

The test cases appropriately cover:

  • Pass case 1: body hash pattern without leading CRLF (line start anchor)
  • Pass case 2: body hash pattern with leading CRLF
  • Fail case: pattern preceded by another header field ("to:"), ensuring the regex doesn't match mid-line
circom/circuits/tests/circuits/test_body_hash_regex.circom (1)

5-5: No issues identified—parameter values are correct.

The BodyHashRegex template is instantiated with maxHaystackBytes=300 and maxMatchBytes=299. These parameters are appropriately configured: the haystack (input data to search) is 300 bytes, and the maximum match length is 299 bytes. Having the match length slightly less than the haystack size is a standard and intentional configuration.

compiler/src/passes/mod.rs (1)

6-9: Dismiss this review comment — the builder module is correctly kept private.

The builder module contains no public type definitions, only impl blocks that add methods to public types (NFAGraph, IntermediateNFA) defined in the ir/ module. External code already has full access to builder functionality through public methods like NFAGraph::build(). Keeping the builder module private is appropriate; it's an internal implementation detail, not a public API surface requiring exposure.

Likely an incorrect or invalid review comment.

circom/circuits/common/subject_all_regex.circom (1)

43-148: Solid DFA wiring

The transition matrix, capture propagation, and path-length/linkage checks line up cleanly with (?:\r\n|^)subject:([^\r\n]+)\r\n; I don’t see correctness gaps here.

circom/circuits/common/message_id_regex.circom (1)

43-153: Regex automaton looks sound

State transitions and capture-group handling match the message-id pattern, and the accept-state accounting mirrors the other proven templates—looks good.

circom/circuits/common/subject_all_graph.json (1)

1-2321: Graph data matches the circuit

The DFA states, transitions, and capture annotations mirror the Circom template’s expectations, so the data asset is consistent.

circom/circuits/common/from_all_regex.circom (1)

43-167: From-header automaton looks consistent

The transitions and capture plumbing reflect the intended (from: …) header pattern, with the same path-length safeguards as the other templates. No issues spotted.

compiler/src/wasm.rs (2)

45-69: Good use of newtype pattern for type safety at WASM boundaries.

The wrapper types provide clear type distinctions at the WASM interface, which helps prevent parameter ordering mistakes.


194-242: Clean error handling in circuit input generation.

The function properly deserializes the NFA graph, generates inputs, and serializes the result with appropriate error handling at each step.

compiler/src/lib.rs (3)

24-41: Clean and well-documented API entry point.

The compile function provides a clear, simple interface for NFA construction with proper error type conversion.


90-126: Clean decomposition and delegation pattern.

The function appropriately delegates to gen_from_raw after composing the regex, avoiding code duplication.


267-294: Thorough error context enrichment for debugging.

The code carefully adds input context to errors that lack it, which will significantly help users debug issues. The pattern of augmenting base errors with additional context is a good practice.

compiler/src/ir/intermediate.rs (3)

13-61: Well-designed intermediate representation for epsilon elimination.

The structures cleanly separate epsilon and byte transitions, and the new method properly initializes all fields.


111-145: Clean conversion pipeline from intermediate to final NFA.

The finalize method correctly orchestrates epsilon elimination and final validation, with clear separation of concerns.


184-201: Verify capture group merging logic during epsilon elimination—no automated tests found.

The concern in the original review is valid: no tests were found that specifically cover epsilon transitions combined with capture groups. The capture merging logic (adding start captures from source closure and end captures from target closure) appears algorithmically sound, but is complex and lacks dedicated test coverage.

Tests that should exist but were not found:

  • Epsilon transitions with capture groups
  • Nested capture groups across epsilon transitions
  • Multiple chained epsilon transitions with captures
  • Start/end captures at different positions
compiler/src/error.rs (4)

9-74: Well-organized error code system with clear categorization.

The error codes are logically grouped into ranges, making it easy to identify error categories at a glance. The Display implementation is complete.


76-135: Excellent error design with rich context and actionable suggestions.

The error variants provide comprehensive context for debugging while keeping related information grouped logically. The consistent inclusion of suggestion fields is particularly user-friendly.


137-235: Clean implementation of error utility methods.

The methods provide clear access to error codes and well-formatted user messages. The is_recoverable() classification is useful for error handling strategies.


237-319: Comprehensive and consistent convenience constructors.

The factory methods provide easy-to-use interfaces for creating common errors with appropriate codes and helpful suggestions. All constructors properly populate their respective error variants.

compiler/src/utils.rs (1)

131-154: No validation needed in decomposed_to_composed_regex—downstream backend code already validates max_len values.

The backend generators validate that max_substring_bytes contains no zero values and return InvalidCapture errors if validation fails. The decomposed_to_composed_regex function correctly collects max_len values; validation happens in the appropriate downstream layer.

Likely an incorrect or invalid review comment.

toolchain: stable

- name: Cache Cargo
uses: actions/cache@v3
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Update actions/cache to v4 for GitHub Actions compatibility.

The actions/cache@v3 action is outdated per actionlint. Update to v4 to ensure compatibility with current GitHub Actions standards and to receive security updates.

-      uses: actions/cache@v3
+      uses: actions/cache@v4
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
uses: actions/cache@v3
uses: actions/cache@v4
🧰 Tools
🪛 actionlint (1.7.8)

23-23: the runner of "actions/cache@v3" action is too old to run on GitHub Actions. update the action's version to fix this issue

(action)

🤖 Prompt for AI Agents
In .github/workflows/test.yml around line 23, the workflow references
actions/cache@v3 which is outdated; update the action reference to
actions/cache@v4 to satisfy actionlint and gain compatibility/security updates.
Modify the uses entry to point to actions/cache@v4 (no other changes required),
commit the workflow, and verify the workflow lints and runs successfully in CI.

# mail-parser = { version = "0.8", git = "https://github.com/stalwartlabs//mail-parser", tag = "0.8.0" }
[workspace.dependencies]
clap = "4.2.1"
comptime = { git = "https://github.com/jp4g/sparse_array", branch = "feat/comptime-codegen" }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

CRITICAL: Replace feature-branch git dependency with stable crate version.

The comptime dependency uses an unstable feature branch from a fork, which creates a production risk. The official comptime crate version 1.0.0 is available on crates.io. Unless the fork contains essential unreleased features, migrate to the stable version:

-comptime = { git = "https://github.com/jp4g/sparse_array", branch = "feat/comptime-codegen" }
+comptime = "1.0.0"

If fork-specific functionality is required, open an issue with the upstream maintainers or document the dependency override rationale in comments.

Also applies to: 9-9

🤖 Prompt for AI Agents
In Cargo.toml around lines 9 to 9, the comptime dependency is pointing at a
feature branch of a forked repo; replace that git dependency with the stable
crates.io release by changing the dependency to use version = "1.0.0" (remove
the git/branch fields), run cargo update / cargo build to refresh Cargo.lock,
and search the repo for any other occurrences of the git override to update or
document; if fork-specific behavior is truly required, add a short comment
explaining the rationale and open an upstream issue instead of keeping the git
dependency.


# logs
logs
_.log
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typo in log file pattern.

Line 15 has _.log which will only match a file literally named "_.log". This should be *.log to match all log files with the .log extension.

Apply this diff:

-_.log
+*.log
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
_.log
*.log
🤖 Prompt for AI Agents
In circom/.gitignore around line 15, the entry currently reads "_.log" which
only matches a literal file named "_.log"; change this to "*.log" so the
gitignore will match all files ending with the .log extension; update the line
to use an asterisk prefix (no other changes needed).

@@ -0,0 +1,943 @@
{
"regex": "([A-Za-z0-9!#$%&'*+=?\\-\\^_`{|}~./@]+@[A-Za-z0-9.\\-]+)",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove extra ‘@’ from the local-part class.

The generated regex on Line 2 still includes @ in the local-part character class, so this graph will accept inputs like foo@[email protected]: the first @ keeps you in the local-part loop, and the second @ transitions to the domain. That violates the single-@ requirement for mailboxes and will yield false positives. Please drop @ from the first character class (and regen this graph) so only the delimiter @ switches to the domain portion.

🤖 Prompt for AI Agents
In circom/circuits/common/email_addr_graph.json around line 2 the local-part
character class incorrectly includes '@', allowing multiple '@' characters
(e.g., foo@[email protected]); remove the '@' from the first character class so
only the single delimiter '@' separates local-part and domain, update the regex
accordingly, and regenerate the graph artifact so the change is reflected in the
generated file.

Comment on lines 34 to 57
signal isWithinPathLengthMinusOne[maxMatchBytes-2];
signal isTransitionLinked[maxMatchBytes];

signal isValidStartState;

signal reachedAcceptState[maxMatchBytes];

component isValidTraversal[maxMatchBytes];

// Select the haystack from the input
signal haystack[maxMatchBytes] <== SelectSubArray(maxHaystackBytes, maxMatchBytes)(inHaystack, matchStart, matchLength);

// Check if the first state in the haystack is a valid start state
isValidStartState <== IsEqual()([startStates[0], currStates[0]]);

for (var i = 0; i < maxMatchBytes; i++) {
isWithinPathLength[i] <== LessThan(log2Ceil(maxMatchBytes))([i, matchLength]);

// Check if the traversal is a valid path
if (i < maxMatchBytes-2) {
isWithinPathLengthMinusOne[i] <== LessThan(log2Ceil(maxMatchBytes))([i, matchLength-1]);
isTransitionLinked[i] <== IsEqual()([nextStates[i], currStates[i+1]]);
isTransitionLinked[i] * isWithinPathLengthMinusOne[i] === isWithinPathLengthMinusOne[i];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Enforce transition linkage for the final step

When matchLength == maxMatchBytes, the constraint block at Line 53 never executes for i = maxMatchBytes - 2, so nextStates[maxMatchBytes-2] is never forced to equal currStates[maxMatchBytes-1]. An attacker can supply an inconsistent final transition and still satisfy all constraints, breaking soundness for full-length matches. Please extend the guard/value array so the last link is checked as well. Example fix:

-    signal isWithinPathLengthMinusOne[maxMatchBytes-2];
+    signal isWithinPathLengthMinusOne[maxMatchBytes-1];
@@
-        if (i < maxMatchBytes-2) {
+        if (i < maxMatchBytes-1) {
             isWithinPathLengthMinusOne[i] <== LessThan(log2Ceil(maxMatchBytes))([i, matchLength-1]);
             isTransitionLinked[i] <== IsEqual()([nextStates[i], currStates[i+1]]);
             isTransitionLinked[i] * isWithinPathLengthMinusOne[i] === isWithinPathLengthMinusOne[i];
         }
🤖 Prompt for AI Agents
In circom/circuits/common/email_addr_regex.circom around lines 34 to 57, the
arrays and loop guard for checking transition linkage stop at maxMatchBytes-2 so
the final link (between nextStates[maxMatchBytes-2] and
currStates[maxMatchBytes-1]) is not enforced when matchLength == maxMatchBytes;
extend the arrays and adjust the loop condition so the last link is validated:
declare isWithinPathLengthMinusOne and isTransitionLinked with size
maxMatchBytes-1 (instead of maxMatchBytes-2 and maxMatchBytes), change the
in-loop guard to if (i < maxMatchBytes-1) and compute
isWithinPathLengthMinusOne[i] with LessThan(log2Ceil(maxMatchBytes))([i,
matchLength-1]) then set isTransitionLinked[i] <== IsEqual()([nextStates[i],
currStates[i+1]]) and constrain isTransitionLinked[i] *
isWithinPathLengthMinusOne[i] === isWithinPathLengthMinusOne[i] so the final
transition is enforced for full-length matches.

Comment on lines +68 to +98
} else if i + 2 < chars.len() && chars[i + 2] == '<' {
// Could be: (?<=...) positive lookbehind, or (?<!...) negative lookbehind, or (?<name>...) PCRE named capture
if i + 3 < chars.len() && (chars[i + 3] == '=' || chars[i + 3] == '!') {
// Lookbehind assertion: (?<=...) or (?<!...)
// These are special groups, preserve as-is
result.push(ch);
} else {
// PCRE named capture: (?<name>...)
// Convert to non-capturing and skip the name
result.push_str("(?:");
i += 2; // Skip '?' and '<'
// Skip until we find the closing '>'
while i + 1 < chars.len() && chars[i + 1] != '>' {
i += 1;
}
if i + 1 < chars.len() && chars[i + 1] == '>' {
i += 1; // Skip the '>'
}
}
} else if i + 3 < chars.len() && chars[i + 2] == 'P' && chars[i + 3] == '<' {
// Rust named capture: (?P<name>...)
// Convert to non-capturing and skip the name
result.push_str("(?:");
i += 3; // Skip '?', 'P', and '<'
// Skip until we find the closing '>'
while i + 1 < chars.len() && chars[i + 1] != '>' {
i += 1;
}
if i + 1 < chars.len() && chars[i + 1] == '>' {
i += 1; // Skip the '>'
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate named capture group syntax to prevent silent pattern corruption.

The code skips to the closing > for named captures but doesn't validate that it exists. Malformed patterns like (?<name... without a closing > will scan to the end of the pattern, potentially corrupting the output.

Consider adding validation:

                     // Skip until we find the closing '>'
                     while i + 1 < chars.len() && chars[i + 1] != '>' {
                         i += 1;
                     }
                     if i + 1 < chars.len() && chars[i + 1] == '>' {
                         i += 1; // Skip the '>'
+                    } else {
+                        // Malformed named capture - no closing '>'
+                        return format!("(?:MALFORMED_CAPTURE)");
                     }

Or better yet, document that this function expects well-formed regex patterns and should be called after regex validation.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In compiler/src/utils.rs around lines 68 to 98, the named-capture skipping logic
does not verify a closing '>' and will run to the end of the input on malformed
patterns; update the logic to detect whether the loop found a '>' and handle the
failure explicitly: after the loop check if the terminating '>' was not found
and either (a) return an error/Result indicating a malformed regex so callers
can validate before calling, or (b) restore/emit the original characters (do not
drop them) and stop trying to skip the name (to avoid silently corrupting
output); implement one of these behaviors consistently for both PCRE (?<name>)
and Rust (?P<name>) branches and add a brief comment explaining the chosen
failure mode.

Comment on lines +156 to +166
pub fn validate_cli_template_name(name: &str) -> Result<String, String> {
// Convert to PascalCase to normalize
let pascal_name = name.to_pascal_case();

// Verify it's valid PascalCase
if pascal_name != name {
return Err("Template name must be in PascalCase (e.g., ThisIsATemplate)".into());
}

Ok(name.to_string())
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Validate against empty template names.

The function doesn't check for empty strings. An empty template name would pass validation since "".to_pascal_case() == "".

Add an early check:

 pub fn validate_cli_template_name(name: &str) -> Result<String, String> {
+    if name.is_empty() {
+        return Err("Template name cannot be empty".into());
+    }
+
     // Convert to PascalCase to normalize
     let pascal_name = name.to_pascal_case();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn validate_cli_template_name(name: &str) -> Result<String, String> {
// Convert to PascalCase to normalize
let pascal_name = name.to_pascal_case();
// Verify it's valid PascalCase
if pascal_name != name {
return Err("Template name must be in PascalCase (e.g., ThisIsATemplate)".into());
}
Ok(name.to_string())
}
pub fn validate_cli_template_name(name: &str) -> Result<String, String> {
if name.is_empty() {
return Err("Template name cannot be empty".into());
}
// Convert to PascalCase to normalize
let pascal_name = name.to_pascal_case();
// Verify it's valid PascalCase
if pascal_name != name {
return Err("Template name must be in PascalCase (e.g., ThisIsATemplate)".into());
}
Ok(name.to_string())
}
🤖 Prompt for AI Agents
In compiler/src/utils.rs around lines 156 to 166, the validate_cli_template_name
function does not reject empty strings (""), which currently pass since
"".to_pascal_case() == ""; add an early check at the top to return Err("Template
name cannot be empty" or similar) if name.trim().is_empty(), then proceed with
the existing PascalCase normalization and comparison; ensure the error message
is consistent with other validation errors.

Comment on lines +29 to +42
impl From<WasmError> for JsValue {
fn from(err: WasmError) -> Self {
// Return a structured error message that JavaScript can parse
match err {
WasmError::CompilerError { code, message } => {
let error_json = format!(
r#"{{"type": "CompilerError", "code": "{}", "message": "{}"}}"#,
code,
message.replace('"', "\\\"").replace('\n', "\\n")
);
JsValue::from_str(&error_json)
}
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use proper JSON serialization instead of manual string escaping.

The manual JSON construction with string replacement is fragile and doesn't handle all special characters (e.g., \r, \t, \, control characters). This could lead to invalid JSON or injection issues.

Use serde_json for safe serialization:

+use serde_json::json;
+
 impl From<WasmError> for JsValue {
     fn from(err: WasmError) -> Self {
-        // Return a structured error message that JavaScript can parse
         match err {
             WasmError::CompilerError { code, message } => {
-                let error_json = format!(
-                    r#"{{"type": "CompilerError", "code": "{}", "message": "{}"}}"#,
-                    code,
-                    message.replace('"', "\\\"").replace('\n', "\\n")
-                );
-                JsValue::from_str(&error_json)
+                let error_json = json!({
+                    "type": "CompilerError",
+                    "code": code,
+                    "message": message
+                });
+                JsValue::from_str(&error_json.to_string())
             }
         }
     }

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In compiler/src/wasm.rs around lines 29 to 42, the From<WasmError> impl manually
constructs JSON with string replacements which is unsafe for many characters;
replace the manual formatting with proper serde_json serialization: derive or
build a small serializable struct or use serde_json::json! to produce a valid
JSON string, call serde_json::to_string (or to_string_pretty) and pass the
result to JsValue::from_str, and handle serialization errors (e.g., log and
return a generic error JSON) and add/import serde_json in Cargo.toml and at the
top of the file.

{
"parts": [
{ "Pattern": "[A-Za-z0-9!#$%&'*+=?\\-\\^_`{|}~./]+@" },
{ "PublicPattern": ["[A-Za-z0-9.\\-@]+", 128] }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove '@' from domain character class.

The PublicPattern includes @ in the character class [A-Za-z0-9.\\-@]+, but email domains should not contain @ symbols. Since Part 1 already ends with @, allowing it in the domain part would permit malformed emails like user@[email protected].

Apply this fix:

-        { "PublicPattern": ["[A-Za-z0-9.\\-@]+", 128] }
+        { "PublicPattern": ["[A-Za-z0-9.\\-]+", 128] }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{ "PublicPattern": ["[A-Za-z0-9.\\-@]+", 128] }
{ "PublicPattern": ["[A-Za-z0-9.\\-]+", 128] }
🤖 Prompt for AI Agents
In noir/common/email_domain.json around line 4, the PublicPattern character
class wrongly includes '@' allowing malformed domains; remove the '@' from the
class so the domain regex does not accept '@' characters, and ensure the hyphen
is either escaped or placed last in the class (e.g. [A-Za-z0-9.\-]+ or
[A-Za-z0-9.-]+) while keeping the same length limit (128).

Comment on lines 181 to +232

- [ ] support character class regex grammar (so as to simplify the regular expressions)
- [ ] better compatibility with regex grammar (will need a set of different tests to assert the compatibility between circuit and regex in languages) -->
## License

## Assumptions
Some email providers put not only the sender's email address but also their username to the From field.
ALthough its concrete formats differ among email providers, our FromAddrRegex template assumes that the email address appears at the end of the From field.
If this assumption does not hold, i.e., the username appears after the email address, an adversary can output an arbitrary email address from that template by including a dummy email address in the username.
This project is licensed under the [Specify License Here - e.g., MIT License or Apache 2.0].
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix license placeholder.

The README now ends with “Specify License Here…”, which contradicts the actual MIT license declared elsewhere. Please replace the placeholder with the real license text or link so downstream users get the correct compliance information.

🤖 Prompt for AI Agents
In README.md around lines 229 to 232, the license section still contains a
placeholder; replace "Specify License Here - e.g., MIT License or Apache 2.0"
with the actual MIT license reference by updating the heading to "License" and
either inserting the short MIT license blurb plus year/owner or linking to the
full text (e.g., "This project is licensed under the MIT License - see LICENSE
file or https://opensource.org/licenses/MIT"), ensuring it matches the project's
LICENSE file.

…roup and rerun regex, input and test generation
… and tests - fails input that enters in conditional path
The epsilon closure computation was incorrectly tracking capture groups globally,
causing captures from unrelated conditional branches to leak into other paths.
This resulted in incorrect capture behavior when conditionals preceded capture
groups in regex patterns.

Changes:
- Modified EpsilonClosure to use per_state_captures (BTreeMap) instead of global captures set
- Updated capture tracking to maintain separate capture sets for each epsilon path
- Fixed start state logic to check for start captures across all epsilon paths
- Regenerated Circom and Noir templates with corrected logic
- Updated test inputs to reflect correct capture behavior
- Added email_addr test case demonstrating conditional before capture group
…eneration

Rutefig/reg 579 fix noir proof generation
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
circom/circuits/common/email_addr_regex.circom (1)

34-61: Enforce the final transition link for full-length matches

When matchLength == maxMatchBytes, the loop never enters the guard with i = maxMatchBytes - 2, so nextStates[maxMatchBytes-2] is never constrained to equal currStates[maxMatchBytes-1]. That lets an inconsistent final hop satisfy all constraints, breaking soundness. Please extend the helper arrays and guard so the last link is also enforced.

-    signal isWithinPathLengthMinusOne[maxMatchBytes-2];
-    signal isTransitionLinked[maxMatchBytes];
+    signal isWithinPathLengthMinusOne[maxMatchBytes-1];
+    signal isTransitionLinked[maxMatchBytes-1];
@@
-        if (i < maxMatchBytes-2) {
+        if (i < maxMatchBytes-1) {
             isWithinPathLengthMinusOne[i] <== LessThan(log2Ceil(maxMatchBytes))([i, matchLength-1]);
             isTransitionLinked[i] <== IsEqual()([nextStates[i], currStates[i+1]]);
             isTransitionLinked[i] * isWithinPathLengthMinusOne[i] === isWithinPathLengthMinusOne[i];
         }
compiler/src/ir/intermediate.rs (1)

63-109: Accept states validation is still missing.

The past review comment correctly identified that accept_states indices are not validated to be within bounds. While start_states, byte transition targets, and epsilon transition targets are validated, an out-of-bounds accept state could cause issues later.

Apply the suggested fix from the previous review:

         // Validate transition targets are within bounds
         for (state_idx, node) in self.nodes.iter().enumerate() {
             // Check byte transitions
             for destinations in node.byte_transitions.values() {
                 for &dest in destinations {
                     if dest >= self.nodes.len() {
                         return Err(NFAError::InvalidTransition(format!(
                             "Byte transition from state {state_idx} to invalid state {dest}"
                         )));
                     }
                 }
             }

             // Check epsilon transitions
             for &dest in &node.epsilon_transitions {
                 if dest >= self.nodes.len() {
                     return Err(NFAError::InvalidTransition(format!(
                         "Epsilon transition from state {state_idx} to invalid state {dest}"
                     )));
                 }
             }
         }
+
+        // Validate accept states are within bounds
+        for &accept in &self.accept_states {
+            if accept >= self.nodes.len() {
+                return Err(NFAError::InvalidStateId(format!(
+                    "Accept state {accept} is out of bounds"
+                )));
+            }
+        }

         Ok(())
🧹 Nitpick comments (1)
compiler/src/ir/intermediate.rs (1)

432-438: Remove unused PathInfo struct (lines 432-438).

Verification confirms this struct is never referenced elsewhere in the codebase. Remove the definition to eliminate dead code:

-/// Information about captures encountered on a specific epsilon path
-#[derive(Debug, Clone)]
-struct PathInfo {
-    /// The destination state of this path
-    target_state: usize,
-    /// Captures encountered on epsilon transitions along this path
-    captures: BTreeSet<(usize, bool)>,
-}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d37c31 and 8f9f916.

📒 Files selected for processing (11)
  • .gitignore (1 hunks)
  • CLAUDE.md (1 hunks)
  • circom/circuits/common/email_addr_regex.circom (1 hunks)
  • circom/circuits/tests/circuits/test_email_addr_regex.circom (1 hunks)
  • circom/circuits/tests/circuits/test_subject_all_regex.circom (1 hunks)
  • circom/circuits/tests/email_addr.test.ts (1 hunks)
  • circom/circuits/tests/subject_all.test.ts (1 hunks)
  • circom/regexes/email_addr.json (1 hunks)
  • compiler/src/ir/intermediate.rs (1 hunks)
  • noir/common/email_addr.json (1 hunks)
  • noir/common/sample_haystacks/circuit_inputs/body_hash_pass_0.json (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • circom/circuits/tests/circuits/test_subject_all_regex.circom
🚧 Files skipped from review as they are similar to previous changes (3)
  • noir/common/email_addr.json
  • noir/common/sample_haystacks/circuit_inputs/body_hash_pass_0.json
  • circom/regexes/email_addr.json
🧰 Additional context used
🧬 Code graph analysis (3)
circom/circuits/tests/subject_all.test.ts (1)
compiler/src/wasm.rs (1)
  • genCircuitInputs (198-217)
circom/circuits/tests/email_addr.test.ts (1)
compiler/src/wasm.rs (1)
  • genCircuitInputs (198-217)
compiler/src/ir/intermediate.rs (5)
compiler/src/backend/circom.rs (2)
  • nfa (229-231)
  • nfa (233-235)
compiler/src/backend/noir.rs (3)
  • nfa (191-193)
  • nfa (195-197)
  • start (430-430)
compiler/src/ir/nfa.rs (2)
  • new (63-71)
  • node (96-99)
compiler/src/lib.rs (1)
  • compile (39-41)
compiler/src/backend/shared.rs (2)
  • path (163-163)
  • path (164-164)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test
  • GitHub Check: test
🔇 Additional comments (11)
.gitignore (2)

1-19: Verify that inputs.txt and nfa.json are always generated artifacts and never committed source files.

Search of the codebase found no references to either inputs.txt or nfa.json in code, scripts, or documentation, and neither file exists in the repository. While this is consistent with both being generated artifacts, it is inconclusive—these entries may be generated by external tools, planned for future use, or no longer needed. Confirm these files are artifacts generated during normal development/build workflows before finalizing the gitignore changes.


11-16: I need to check if there are actually any .js files in the scripts/ directories that would be affected by these patterns, and examine the build configuration:

I need to search for actual JavaScript files and examine the build configuration:

Based on my verification, I can now provide the rewritten review comment. The analysis reveals that while the patterns are appropriate for ignoring compiled TypeScript artifacts, there's a subtle scope issue with how the patterns work.

Verify nested directory handling—pattern scope may be too narrow.

The patterns **/scripts/*.js and **/scripts/*.d.ts only match files directly in scripts/ directories, not in nested subdirectories like scripts/utils/ or scripts/__tests__/. When there is a separator at the beginning or middle of the pattern, then the pattern is relative to the directory level of the .gitignore file itself. Since compiled output from .ts files typically goes to dist/ or similar separate directories rather than inline with source, the current patterns are appropriate for the legacy/temporary compiled artifacts. However, if the build process ever emits .js files within nested scripts/ subdirectories, those would not be caught. Confirm that your TypeScript build configuration outputs to a directory outside the scripts/ tree (e.g., dist/), or extend the patterns to **/scripts/**/*.js if needed.

CLAUDE.md (6)

210-239: API signatures verified—no issues found.

All four documented Rust API entry points (compile, gen_from_raw, gen_from_decomposed, gen_circuit_inputs) have been verified against compiler/src/lib.rs. Each function signature, including parameter types and return types, matches the documentation in CLAUDE.md exactly. The CompilerResult<T> type alias is also correctly defined. Documentation is accurate.


39-51: All documented build commands align with configuration—no issues found.

Verification confirms that all build commands referenced in CLAUDE.md are properly configured:

  • bun run build and bun run build-release exist in package.json and correctly delegate to the compiler directory
  • cargo build and cargo build --release are executed within the build scripts
  • wasm-pack build --target nodejs is configured in compiler/package.json
  • wasm-pack build --target web is also configured as an additional build target
  • Cargo workspace properly includes compiler as a member

The documentation accurately reflects the build system configuration.


127-151: Pre-push hook implementation verified—no issues found.

The documentation in CLAUDE.md is accurate. Verification confirms:

  • Hook file exists at git-hooks/pre-push with full implementation
  • Postinstall script correctly installs via install-hooks command
  • Template regeneration commands exist and are properly integrated
  • Hook scans commits for compiler/src/ changes, regenerates templates, validates synchronization, and blocks push when needed
  • All described error recovery steps and emergency bypass are correct

153-161: The original review comment is incorrect—circom implementation of this workflow does not exist.

Verification confirms:

  • gen-inputs:noir script exists in scripts/package.json ✓
  • gen-inputs:circom script does not exist (only gen-regex:circom is defined)
  • noir/common/sample_haystacks/ directory exists with JSON source files ✓
  • circom/common/ directory does not exist
  • circom/scripts/ contains only gen-regex.ts, not gen-inputs.ts

The documentation references a test generation workflow for circom that is not implemented. Only the noir implementation has the complete input generation pipeline. Either remove the circom references from the documentation or implement the missing circom workflow.

Likely an incorrect or invalid review comment.


56-67: Disregard this review comment—the premise is incorrect.

The PR does not remove legacy Circom tests. The documentation in CLAUDE.md accurately reflects the current state: test:circom runs the existing Circom circuit tests (body_hash, email_addr, subject_all) that continue to exist in circom/circuits/tests/. No circom test files were modified or removed in this PR.

Likely an incorrect or invalid review comment.


72-80: ****

The documentation is accurate. gen-inputs:circom does not exist in package.json because input generation for Circom is not part of the new TypeScript/Bun architecture. The Noir framework has separate template generation (gen-regex:noir) and input generation (gen-inputs:noir) scripts, while Circom only has template generation (gen-regex:circom). CLAUDE.md correctly reflects this architectural choice by documenting only gen-inputs:noir. Listing a non-existent command would be inaccurate.

Likely an incorrect or invalid review comment.

compiler/src/ir/intermediate.rs (3)

148-264: Well-designed epsilon elimination algorithm.

The epsilon elimination implementation is sophisticated and carefully preserves capture group semantics. Notable strengths:

  • Lines 189-196: The "CRITICAL FIX" correctly handles per-path captures for epsilon transitions
  • Lines 221-243: Thoughtful handling of start states to avoid bypassing start captures
  • Lines 199-207: Proper accumulation of end captures from epsilon closures

The algorithm correctly handles the complex interaction between epsilon transitions and capture groups, which is essential for regex patterns with optional groups.


440-496: Excellent regression test coverage.

The test comprehensively validates the epsilon elimination fix for optional patterns before capture groups. It correctly:

  • Tests both paths through the regex (with and without the optional prefix)
  • Validates rejection of non-matching inputs
  • Specifically checks for exactly one capture start event, preventing the duplicate marker bug

This is a high-quality regression test that will prevent future breakage of the epsilon elimination logic.


298-303: No issues found. The review comment is incorrect.

The capture_groups map only contains entries for epsilon transition targets by design. This is evident from add_capture_transition (builder.rs lines 209-230), which is the only place where capture_groups entries are created. In that method, a target is inserted into both epsilon_transitions and capture_groups together—there is no separate storage of byte transition captures in capture_groups.

Therefore, the DFS function at lines 298-303 correctly iterates over all capture_groups entries because all such entries correspond to epsilon targets. The comment "Add captures from this state's epsilon transitions" is accurate.

Comment on lines +245 to +252
## Error Handling

Structured error system with error codes:
- **E1001**: Invalid regex syntax
- **E1002**: Unsupported regex features
- **E3002**: Invalid capture group configuration
- **E5001**: Invalid configuration parameters

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify error codes and CompilerError enum structure.

The documented error codes (E1001, E1002, E3002, E5001) should be verified against the actual CompilerError enum implementation. The gap between E3002 and E5001 suggests either:

  • Additional error codes are missing from the documentation
  • There is intentional error code grouping/categorization

Ensure documentation is complete and matches implementation.


🏁 Script executed:

#!/bin/bash
# Description: Extract error codes from CompilerError implementation

echo "=== CompilerError enum and error codes ==="
rg 'enum CompilerError|pub enum CompilerError' compiler/src/ -A 30 -U

echo -e "\n=== Error code constants or variants ==="
rg 'E\d{4}|"E[0-9]{4}"' compiler/src/error.rs 2>/dev/null || \
rg 'E\d{4}|"E[0-9]{4}"' compiler/src/ --type rust

Length of output: 3767


Update documentation to reflect all implemented error codes and their categories.

The documentation lists only 4 error codes (E1001, E1002, E3002, E5001), but the actual ErrorCode enum in compiler/src/error.rs contains 23 codes organized into 6 categories:

  • E1xxx (Regex validation): E1001–E1004
  • E2xxx (NFA construction): E2001–E2004
  • E3xxx (Circuit generation): E3001–E3004
  • E4xxx (Input processing): E4001–E4004
  • E5xxx (Configuration): E5001–E5003
  • E9xxx (Serialization/Internal): E9001–E9003

Update the CLAUDE.md documentation to include all codes and their intended purposes, matching the implementation.

🤖 Prompt for AI Agents
In CLAUDE.md around lines 245 to 252, the Error Handling section currently lists
only four error codes; update this section to enumerate all 23 implemented codes
and their categories to match compiler/src/error.rs. Replace the short list with
the six categories (E1xxx Regex validation: E1001–E1004 with brief purpose for
each, E2xxx NFA construction: E2001–E2004, E3xxx Circuit generation:
E3001–E3004, E4xxx Input processing: E4001–E4004, E5xxx Configuration:
E5001–E5003, E9xxx Serialization/Internal: E9001–E9003) and add a one-line
description for each code or consolidated description per category reflecting
the enum comments/intent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants