⚡️ Speed up function clean_prefix by 104%#257
Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
Open
Conversation
The optimized code achieves a **103% speedup** (2x faster) by avoiding expensive regex compilation and matching for the common case where patterns are simple literal strings rather than true regular expressions.
## Key Optimization Strategy
**What changed:**
1. **Fast-path for empty patterns**: Immediately returns the text (possibly lstripped) without any regex processing.
2. **Literal string detection**: Checks if the pattern contains regex metacharacters (`.^$*+?{}[]\\|()`). If not, treats it as a literal string.
3. **Direct string operations for literals**: Uses `str.startswith()` (case-sensitive) or `str.casefold()` comparison (case-insensitive) with simple slicing instead of regex substitution.
4. **Compiled regex fallback**: Only for true regex patterns, compiles the pattern once and uses `compiled.sub()` instead of `re.sub()`.
## Why This Is Faster
**Line profiler reveals the bottleneck**: In the original code, 99.7% of time (80ms out of 80.2ms) was spent in the single `re.sub()` call. This is because:
- Python's regex engine must compile the pattern on every call
- Even simple literal patterns like `"PREFIX"` go through full regex matching machinery
- The overhead is significant for short strings and simple patterns
**The optimization eliminates this overhead** by:
- **Literal prefix removal**: `text.startswith(pattern)` followed by `text[plen:]` is orders of magnitude faster than regex matching (pure string operations vs. state machine execution)
- **Empty pattern handling**: Immediately returns without any processing (400-500% faster in tests)
- **Single regex compilation**: When regex is needed, compiling once and reusing avoids repeated compilation overhead
## Performance by Test Case Type
**Best improvements (100-1900% faster):**
- Empty patterns: 400-500% speedup (`test_basic_empty_pattern`, `test_edge_empty_and_empty`)
- Large texts with literal patterns: 500-1900% speedup (`test_large_scale_very_long_text`, `test_large_scale_long_text_no_match`)
- Simple literal prefixes: 40-60% speedup (most basic tests)
- Case-insensitive literal matching: 50-55% speedup using `casefold()`
**Slower cases (14-85% slower):**
- True regex patterns with metacharacters: 14-30% slower due to added literal-detection overhead and regex compilation
- Very long patterns (500+ chars): 53-85% slower due to character-by-character metacharacter checking
The optimization trades a small penalty on regex patterns for massive gains on literal strings, which represent the majority of real-world usage patterns based on the test suite.
## Impact Assessment
Since `function_references` is not available, the optimization's value depends on how often `clean_prefix()` is called:
- **High-frequency hot paths**: The 2x average speedup will significantly reduce cumulative overhead, especially if most calls use literal patterns (common for prefix removal in text processing pipelines)
- **Large text processing**: Tests show 5-19x speedup for texts >10KB with literal patterns
- **Batch operations**: The optimization compounds when processing many documents
The optimization is **safe and behavior-preserving** - all test cases pass with identical output, and the regex fallback ensures backward compatibility for complex patterns.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📄 104% (1.04x) speedup for
clean_prefixinunstructured/cleaners/core.py⏱️ Runtime :
1.50 milliseconds→735 microseconds(best of32runs)📝 Explanation and details
The optimized code achieves a 103% speedup (2x faster) by avoiding expensive regex compilation and matching for the common case where patterns are simple literal strings rather than true regular expressions.
Key Optimization Strategy
What changed:
.^$*+?{}[]\\|()). If not, treats it as a literal string.str.startswith()(case-sensitive) orstr.casefold()comparison (case-insensitive) with simple slicing instead of regex substitution.compiled.sub()instead ofre.sub().Why This Is Faster
Line profiler reveals the bottleneck: In the original code, 99.7% of time (80ms out of 80.2ms) was spent in the single
re.sub()call. This is because:"PREFIX"go through full regex matching machineryThe optimization eliminates this overhead by:
text.startswith(pattern)followed bytext[plen:]is orders of magnitude faster than regex matching (pure string operations vs. state machine execution)Performance by Test Case Type
Best improvements (100-1900% faster):
test_basic_empty_pattern,test_edge_empty_and_empty)test_large_scale_very_long_text,test_large_scale_long_text_no_match)casefold()Slower cases (14-85% slower):
The optimization trades a small penalty on regex patterns for massive gains on literal strings, which represent the majority of real-world usage patterns based on the test suite.
Impact Assessment
Since
function_referencesis not available, the optimization's value depends on how oftenclean_prefix()is called:The optimization is safe and behavior-preserving - all test cases pass with identical output, and the regex fallback ensures backward compatibility for complex patterns.
✅ Correctness verification report:
⚙️ Click to see Existing Unit Tests
cleaners/test_core.py::test_clean_prefix🌀 Click to see Generated Regression Tests
To edit these changes
git checkout codeflash/optimize-clean_prefix-mkrvyyunand push.