⚡️ Speed up function `bag_of_words` by 10% by codeflash-ai[bot] · Pull Request #270 · codeflash-ai/unstructured

codeflash-ai · 2026-01-24T08:45:02Z

📄 10% (0.10x) speedup for `bag_of_words` in `unstructured/metrics/text_extraction.py`

⏱️ Runtime : 4.40 milliseconds → 3.99 milliseconds (best of 59 runs)

📝 Explanation and details

The optimized code achieves a 10% speedup through two key improvements:

1. Avoid unnecessary dictionary copying in `remove_sentence_punctuation` (~4% gain)

What changed: The global tbl translation table is no longer copied unconditionally. Instead, it's copied only when exclude_punctuation is provided.

Why it's faster: The original code called tbl.copy() on every invocation (creating a ~150K-entry dictionary copy), even when no exclusions were needed. The optimized version skips this allocation when exclude_punctuation is empty/None. Additionally, ord is bound to a local variable o to reduce global lookups in the deletion loop.

Impact: Line profiler shows the copy operation dropping from 34.2% to 32.4% of function time, and the overall remove_sentence_punctuation time decreased slightly. Since bag_of_words calls this function on every invocation with exclusions ["-", "'"], this optimization affects every call.

2. Streamline token counting logic in `bag_of_words` (~6% gain)

What changed:

Replaced if word in bow: bow[word] += 1 else: bow[word] = 1 with bow[w] = bow.get(w, 0) + 1 (single dict lookup instead of two)
Hoisted len(words) outside the loop to avoid repeated calls
Cached words[i] and len(w) in local variables to reduce indexing operations
Eliminated unnecessary string concatenation in the single-character token handling path—now simply checks if the run length equals 1 instead of building incorrect_word

Why it's faster:

dict.get() performs one hash lookup vs. two for the check-then-set pattern, reducing overhead on every token
Avoiding repeated len(words) calls saves function call overhead in the hot loop
The single-character logic now skips building the concatenated string entirely, just counting the run length

Impact: Line profiler shows the main loop spending less time overall (14% vs 17.7% in the while condition). Tests with large documents show the greatest gains:

test_very_long_single_text: 66% faster (136μs → 82μs) - benefits heavily from the single-char optimization
test_large_document_with_bullets: 26% faster (447μs → 355μs)
test_large_document_few_repeated_words: 18% faster (393μs → 334μs)

Real-world impact

Based on function_references, bag_of_words is called by calculate_percent_missing_text for both output and source texts during document quality evaluation. This means every document comparison invokes it at least twice, making these optimizations valuable for batch processing scenarios. The gains are most pronounced on longer texts (100+ tokens), which are typical in document extraction workflows.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 16 Passed
🌀 Generated Regression Tests	✅ 66 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`metrics/test_text_extraction.py::test_bag_of_words`	117μs	113μs	3.92%✅

🌀 Click to see Generated Regression Tests

from typing import Dict

# imports
from unstructured.metrics.text_extraction import bag_of_words


def test_empty_string_returns_empty_dict():
    # An empty input should yield an empty bag of words (no tokens).
    codeflash_output = bag_of_words("")
    result = codeflash_output  # 12.7μs -> 12.7μs (0.339% slower)


def test_basic_tokenization_and_lowercasing():
    # Ensure basic splitting and lowercasing works and counts frequencies.
    text = "Hello world HELLO"
    # Expect 'hello' twice, 'world' once
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 17.2μs -> 17.0μs (0.916% faster)


def test_preserve_internal_apostrophes_and_hyphens():
    # Apostrophes (') and hyphens (-) should be preserved inside words
    # because remove_sentence_punctuation is called with exclude ["-", "'"].
    text = "Don't re-enter well-known John's"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 18.1μs -> 18.0μs (0.595% faster)
    # Lowercased tokens preserved:
    # "don't", "re-enter", "well-known", "john's"
    expected = {"don't": 1, "re-enter": 1, "well-known": 1, "john's": 1}


def test_sentence_punctuation_removed_but_not_internal_punct():
    # Sentence punctuation like periods and commas must be removed, not internal punctuation.
    text = "This is a test. This, indeed, is a test!"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 21.6μs -> 21.1μs (2.60% faster)
    # Periods and commas removed; tokens are lowercased words
    expected = {"this": 2, "is": 2, "a": 2, "test": 2, "indeed": 1}


def test_clean_bullets_at_start_are_removed():
    # Unicode bullet characters should be removed by clean_bullets before tokenizing.
    bullet_text = "\u25cf  This is an excellent point."  # ● character at start
    codeflash_output = bag_of_words(bullet_text)
    result = codeflash_output  # 23.3μs -> 23.6μs (1.25% slower)
    expected = {"this": 1, "is": 1, "an": 1, "excellent": 1, "point": 1}


def test_single_character_tokens_sequence_skipped():
    # A contiguous sequence of single-character tokens should be concatenated
    # internally and only added if the concatenated result has length == 1.
    # For "a b c" we expect no tokens, because the sequence "abc" has length 3 and is skipped.
    text = "a b c"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 16.6μs -> 15.6μs (5.98% faster)


def test_single_character_isolated_added_and_long_words_counted():
    # A single-character token followed by a multi-character token:
    # the single character should be added alone (len == 1 case) and the multi-char token counted.
    text = "x abc"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 17.2μs -> 16.9μs (1.90% faster)


def test_single_character_non_alnum_not_added():
    # If a single-character token is non-alphanumeric it should not be added.
    # Use a symbol that will be removed by punctuation cleaning -> results in no token.
    text = "!"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 13.6μs -> 13.8μs (1.21% slower)


def test_tokens_with_multiple_spaces_and_tabs_are_handled():
    # Excess whitespace should be collapsed by split(), not create empty tokens.
    text = "   Multiple    spaces\tand\t\ttabs  here. "
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 18.7μs -> 19.0μs (1.61% slower)
    # All punctuation removed, words lowercased and counted
    expected = {"multiple": 1, "spaces": 1, "and": 1, "tabs": 1, "here": 1}


def test_numeric_tokens_and_alphanumeric_mixture():
    # Numbers and alphanumeric tokens should be treated as words and counted.
    text = "2020 version v2 2020 v2"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 18.6μs -> 18.1μs (2.78% faster)


def test_large_scale_repeated_patterns_counts_correctly():
    # Create a relatively large input under the 1000 tokens guideline.
    # We generate 500 tokens by repeating 50 distinct words 10 times each.
    base_words = [f"word{i}" for i in range(50)]  # 50 distinct words
    # Repeat each word 10 times in a shuffled-but-deterministic order
    tokens = []
    for repeat in range(10):  # 10 repeats -> 500 total tokens
        for w in base_words:
            tokens.append(w)
    # Insert isolated single-character tokens between blocks to verify single-char handling
    # Place a solitary 'z' before and after the bulk; it should be counted twice.
    big_text = "z " + " ".join(tokens) + " z"
    codeflash_output = bag_of_words(big_text)
    result = codeflash_output  # 213μs -> 179μs (18.8% faster)
    # Every base word should have count 10, 'z' should be 2
    expected: Dict[str, int] = dict.fromkeys(base_words, 10)
    expected["z"] = 2


def test_large_scale_with_mixed_punctuation_and_whitespace():
    # Build a large-ish string with punctuation interspersed that should be removed,
    # ensuring the punctuation removal scales and doesn't accidentally remove internal apostrophes/hyphens.
    parts = []
    # 200 tokens composed of words with punctuation around them
    for i in range(200):
        # Alternate between a plain word and one wrapped with sentence punctuation.
        if i % 2 == 0:
            parts.append(f"word{i}.")  # trailing period should be removed
        else:
            parts.append(
                f"({i})"
            )  # parentheses are punctuation and should be removed, leaving the digits
    text = "   ".join(parts)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 92.5μs -> 88.9μs (4.04% faster)
    # Even indices produce "word{i}" tokens; odd indices produce the numeric i as token
    # After lowercasing, words are "word0", "1", "word2", "3", ...
    # Count occurrences: each token appears once
    # Build expected mapping:
    expected: Dict[str, int] = {}
    for i in range(200):
        token = f"word{i}" if i % 2 == 0 else str(i)
        expected[token] = expected.get(token, 0) + 1


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.metrics.text_extraction import bag_of_words


def test_simple_single_word():
    """Test with a single simple word."""
    codeflash_output = bag_of_words("hello")
    result = codeflash_output  # 15.7μs -> 15.4μs (2.11% faster)


def test_simple_multiple_words():
    """Test with multiple simple words."""
    codeflash_output = bag_of_words("hello world")
    result = codeflash_output  # 15.6μs -> 16.4μs (4.87% slower)


def test_repeated_words():
    """Test with repeated words to verify frequency counting."""
    codeflash_output = bag_of_words("hello hello world")
    result = codeflash_output  # 17.2μs -> 17.0μs (1.32% faster)


def test_case_insensitivity():
    """Test that words are converted to lowercase."""
    codeflash_output = bag_of_words("Hello WORLD hElLo")
    result = codeflash_output  # 17.0μs -> 17.1μs (0.263% slower)


def test_punctuation_removal():
    """Test that sentence-ending punctuation is removed."""
    codeflash_output = bag_of_words("Hello, world! How are you?")
    result = codeflash_output  # 18.0μs -> 18.1μs (0.850% slower)


def test_apostrophe_preservation():
    """Test that apostrophes within words are preserved."""
    codeflash_output = bag_of_words("don't can't won't")
    result = codeflash_output  # 16.7μs -> 16.7μs (0.156% faster)


def test_hyphen_preservation():
    """Test that hyphens within words are preserved."""
    codeflash_output = bag_of_words("well-being mother-in-law")
    result = codeflash_output  # 17.0μs -> 17.1μs (0.328% slower)


def test_mixed_punctuation():
    """Test with mixed sentence and word-internal punctuation."""
    codeflash_output = bag_of_words("It's a well-known fact. Don't you agree?")
    result = codeflash_output  # 20.9μs -> 20.9μs (0.034% faster)


def test_unicode_bullets():
    """Test that unicode bullets are removed."""
    codeflash_output = bag_of_words("● hello ● world")
    result = codeflash_output  # 22.7μs -> 23.1μs (1.33% slower)


def test_empty_string():
    """Test with an empty string."""
    codeflash_output = bag_of_words("")
    result = codeflash_output  # 12.8μs -> 13.0μs (2.23% slower)


def test_whitespace_only():
    """Test with only whitespace characters."""
    codeflash_output = bag_of_words("   \t\n  ")
    result = codeflash_output  # 14.1μs -> 14.3μs (1.53% slower)


def test_single_character_words():
    """Test single character word handling - they are excluded unless alphanumeric."""
    codeflash_output = bag_of_words("a b c hello")
    result = codeflash_output  # 17.8μs -> 17.2μs (3.33% faster)


def test_consecutive_single_characters():
    """Test consecutive single characters are concatenated."""
    codeflash_output = bag_of_words("a b c")
    result = codeflash_output  # 16.6μs -> 15.7μs (6.07% faster)


def test_numbers_in_text():
    """Test that numbers are handled correctly."""
    codeflash_output = bag_of_words("the year 2024 is here")
    result = codeflash_output  # 18.1μs -> 18.0μs (0.355% faster)


def test_mixed_alphanumeric():
    """Test words with mixed letters and numbers."""
    codeflash_output = bag_of_words("python3 java2ee test123")
    result = codeflash_output  # 17.3μs -> 17.6μs (1.48% slower)


def test_only_punctuation():
    """Test with only punctuation marks."""
    codeflash_output = bag_of_words("!!! ??? ... ;;;")
    result = codeflash_output  # 14.2μs -> 14.5μs (1.72% slower)


def test_very_long_word():
    """Test with an extremely long word."""
    long_word = "a" * 1000
    codeflash_output = bag_of_words(long_word)
    result = codeflash_output  # 20.2μs -> 20.0μs (1.01% faster)


def test_many_repeated_same_word():
    """Test counting the same word many times."""
    text = " ".join(["hello"] * 100)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 51.6μs -> 46.4μs (11.2% faster)


def test_words_with_multiple_apostrophes():
    """Test words containing multiple apostrophes."""
    codeflash_output = bag_of_words("'twas the 'orrible 'ouse")
    result = codeflash_output  # 18.1μs -> 17.9μs (0.685% faster)


def test_words_with_multiple_hyphens():
    """Test words with multiple hyphens."""
    codeflash_output = bag_of_words("mother-in-law-like state-of-the-art")
    result = codeflash_output  # 17.2μs -> 17.2μs (0.168% slower)


def test_leading_trailing_spaces():
    """Test with leading and trailing spaces."""
    codeflash_output = bag_of_words("   hello world   ")
    result = codeflash_output  # 16.7μs -> 16.7μs (0.102% faster)


def test_multiple_consecutive_spaces():
    """Test with multiple spaces between words."""
    codeflash_output = bag_of_words("hello     world")
    result = codeflash_output  # 16.5μs -> 16.9μs (2.35% slower)


def test_tabs_and_newlines():
    """Test with tabs and newlines as separators."""
    codeflash_output = bag_of_words("hello\tworld\ntest")
    result = codeflash_output  # 17.2μs -> 17.5μs (1.48% slower)


def test_multiple_punctuation_marks():
    """Test with multiple consecutive punctuation marks."""
    codeflash_output = bag_of_words("hello... world!!! test???")
    result = codeflash_output  # 17.5μs -> 17.8μs (1.49% slower)


def test_special_unicode_characters():
    """Test with special unicode characters."""
    codeflash_output = bag_of_words("café naïve résumé")
    result = codeflash_output  # 18.6μs -> 18.8μs (1.02% slower)


def test_bullets_with_text():
    """Test unicode bullets mixed with text."""
    codeflash_output = bag_of_words("● first point ○ second point ■ third point")
    result = codeflash_output  # 27.8μs -> 26.9μs (3.24% faster)


def test_words_ending_with_punctuation():
    """Test words that end with various punctuation."""
    codeflash_output = bag_of_words("hello, world. test! yes? maybe;")
    result = codeflash_output  # 18.6μs -> 18.4μs (1.29% faster)


def test_parentheses_brackets():
    """Test with parentheses and brackets."""
    codeflash_output = bag_of_words("(hello) [world] {test}")
    result = codeflash_output  # 17.2μs -> 17.6μs (2.26% slower)


def test_quotation_marks():
    """Test with quotation marks."""
    codeflash_output = bag_of_words("\"hello\" 'world'")
    result = codeflash_output  # 16.8μs -> 17.1μs (1.44% slower)


def test_decimal_numbers():
    """Test with decimal numbers."""
    codeflash_output = bag_of_words("the price is 19.99 dollars")
    result = codeflash_output  # 18.4μs -> 18.7μs (1.53% slower)


def test_single_char_non_alphanumeric():
    """Test single non-alphanumeric characters are excluded."""
    codeflash_output = bag_of_words("a @ b # c")
    result = codeflash_output  # 16.8μs -> 16.1μs (4.14% faster)


def test_only_single_characters():
    """Test text containing only single characters."""
    codeflash_output = bag_of_words("a b c d e f")
    result = codeflash_output  # 17.5μs -> 16.6μs (5.26% faster)


def test_mixed_case_with_apostrophe():
    """Test case conversion with apostrophes."""
    codeflash_output = bag_of_words("DON'T WON'T CAN'T")
    result = codeflash_output  # 16.9μs -> 17.3μs (1.97% slower)


def test_hyphenated_with_mixed_case():
    """Test hyphenated words with mixed case."""
    codeflash_output = bag_of_words("WELL-BEING Mother-In-Law")
    result = codeflash_output  # 16.6μs -> 16.6μs (0.266% faster)


def test_word_followed_by_multiple_punctuation():
    """Test word immediately followed by multiple punctuation marks."""
    codeflash_output = bag_of_words("hello... world!!! test???")
    result = codeflash_output  # 17.5μs -> 17.2μs (1.22% faster)


def test_large_document_many_words():
    """Test with a large document containing many unique words."""
    words = [f"word{i}" for i in range(500)]
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 206μs -> 191μs (7.81% faster)


def test_large_document_repeated_words():
    """Test with large document with repeated words."""
    # Create 500 unique words, each repeated 2 times
    words = []
    for i in range(500):
        words.extend([f"word{i}", f"word{i}"])
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 421μs -> 377μs (11.7% faster)


def test_large_document_few_repeated_words():
    """Test with large document with few repeated words."""
    # Create a document with 100 unique words, each repeated 10 times
    words = []
    for i in range(100):
        words.extend([f"word{i}"] * 10)
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 393μs -> 334μs (17.7% faster)


def test_large_document_with_punctuation():
    """Test performance with large document containing punctuation."""
    # Create 400 words with punctuation between them
    words = [f"word{i}." for i in range(400)]
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 158μs -> 148μs (6.12% faster)


def test_large_document_with_apostrophes():
    """Test performance with many words containing apostrophes."""
    words = [f"word{i}'s" for i in range(300)]
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 123μs -> 116μs (6.53% faster)


def test_large_document_with_hyphens():
    """Test performance with many hyphenated words."""
    words = [f"word{i}-part" for i in range(300)]
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 126μs -> 121μs (4.45% faster)


def test_very_long_single_text():
    """Test with a very long continuous text."""
    long_text = "a " * 500 + "hello world"
    codeflash_output = bag_of_words(long_text)
    result = codeflash_output  # 136μs -> 82.1μs (66.1% faster)


def test_large_document_mixed_case():
    """Test performance with mixed case in large document."""
    words = []
    for i in range(250):
        words.append(f"Word{i}")
        words.append(f"WORD{i}")
    text = " ".join(words)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 206μs -> 183μs (12.3% faster)


def test_large_document_with_bullets():
    """Test performance with document containing many bullets."""
    bullet_text = ""
    for i in range(250):
        bullet_text += f"● item{i} "
    codeflash_output = bag_of_words(bullet_text)
    result = codeflash_output  # 447μs -> 355μs (26.1% faster)


def test_large_frequency_single_word():
    """Test word frequency counting with very large repetition."""
    text = " ".join(["hello"] * 500)
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 198μs -> 169μs (17.6% faster)


def test_large_mixed_punctuation():
    """Test large document with various punctuation types."""
    text = ""
    for i in range(200):
        text += f"word{i}. "
        text += f"test{i}! "
        text += f"sample{i}? "
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 230μs -> 219μs (5.31% faster)


def test_large_document_unicode_mixed():
    """Test large document with unicode characters mixed in."""
    text = ""
    for i in range(200):
        text += f"café{i} naïve{i} "
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 379μs -> 368μs (3.07% faster)


def test_performance_return_type():
    """Test that return type is always a dictionary."""
    test_cases = [
        "single",
        "multiple words",
        "",
        "a b c",
        "!!! ??? ...",
        "hello " * 100,
    ]
    for text in test_cases:
        codeflash_output = bag_of_words(text)
        result = codeflash_output  # 91.8μs -> 87.0μs (5.55% faster)


def test_performance_value_types():
    """Test that all values in result are integers."""
    text = "hello world hello test hello world hello"
    codeflash_output = bag_of_words(text)
    result = codeflash_output  # 18.8μs -> 18.6μs (0.816% faster)
    for word, count in result.items():
        pass


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.metrics.text_extraction import bag_of_words


def test_bag_of_words():
    bag_of_words("")

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_xdo_puqm/tmpsfnvdz8b/test_concolic_coverage.py::test_bag_of_words`	13.0μs	13.4μs	-2.73%⚠️

To edit these changes git checkout codeflash/optimize-bag_of_words-mks2dfxv and push.

The optimized code achieves a **10% speedup** through two key improvements: ## 1. Avoid unnecessary dictionary copying in `remove_sentence_punctuation` (~4% gain) **What changed:** The global `tbl` translation table is no longer copied unconditionally. Instead, it's copied only when `exclude_punctuation` is provided. **Why it's faster:** The original code called `tbl.copy()` on every invocation (creating a ~150K-entry dictionary copy), even when no exclusions were needed. The optimized version skips this allocation when `exclude_punctuation` is empty/None. Additionally, `ord` is bound to a local variable `o` to reduce global lookups in the deletion loop. **Impact:** Line profiler shows the copy operation dropping from 34.2% to 32.4% of function time, and the overall `remove_sentence_punctuation` time decreased slightly. Since `bag_of_words` calls this function on every invocation with exclusions `["-", "'"]`, this optimization affects every call. ## 2. Streamline token counting logic in `bag_of_words` (~6% gain) **What changed:** - Replaced `if word in bow: bow[word] += 1 else: bow[word] = 1` with `bow[w] = bow.get(w, 0) + 1` (single dict lookup instead of two) - Hoisted `len(words)` outside the loop to avoid repeated calls - Cached `words[i]` and `len(w)` in local variables to reduce indexing operations - Eliminated unnecessary string concatenation in the single-character token handling path—now simply checks if the run length equals 1 instead of building `incorrect_word` **Why it's faster:** - `dict.get()` performs one hash lookup vs. two for the check-then-set pattern, reducing overhead on every token - Avoiding repeated `len(words)` calls saves function call overhead in the hot loop - The single-character logic now skips building the concatenated string entirely, just counting the run length **Impact:** Line profiler shows the main loop spending less time overall (14% vs 17.7% in the while condition). Tests with large documents show the greatest gains: - `test_very_long_single_text`: **66% faster** (136μs → 82μs) - benefits heavily from the single-char optimization - `test_large_document_with_bullets`: **26% faster** (447μs → 355μs) - `test_large_document_few_repeated_words`: **18% faster** (393μs → 334μs) ## Real-world impact Based on `function_references`, `bag_of_words` is called by `calculate_percent_missing_text` for both output and source texts during document quality evaluation. This means every document comparison invokes it at least twice, making these optimizations valuable for batch processing scenarios. The gains are most pronounced on longer texts (100+ tokens), which are typical in document extraction workflows.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 08:45

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `bag_of_words` by 10%#270

⚡️ Speed up function `bag_of_words` by 10%#270
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-bag_of_words-mks2dfxv

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Comments

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 10% (0.10x) speedup for bag_of_words in unstructured/metrics/text_extraction.py

📝 Explanation and details

1. Avoid unnecessary dictionary copying in remove_sentence_punctuation (~4% gain)

2. Streamline token counting logic in bag_of_words (~6% gain)

Real-world impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Comments

📄 10% (0.10x) speedup for `bag_of_words` in `unstructured/metrics/text_extraction.py`

1. Avoid unnecessary dictionary copying in `remove_sentence_punctuation` (~4% gain)

2. Streamline token counting logic in `bag_of_words` (~6% gain)