⚡️ Speed up function `bytes_string_to_string` by 2,121% by codeflash-ai[bot] · Pull Request #259 · codeflash-ai/unstructured

codeflash-ai · 2026-01-24T06:02:55Z

📄 2,121% (21.21x) speedup for `bytes_string_to_string` in `unstructured/cleaners/core.py`

⏱️ Runtime : 5.04 milliseconds → 227 microseconds (best of 65 runs)

📝 Explanation and details

The optimized code achieves a 21x speedup (2121%) by replacing an inefficient character-by-character byte construction with Python's native encode() method.

Key Optimization

Original approach:

text_bytes = bytes([ord(char) for char in text])

Creates a list comprehension iterating over every character
Calls ord() for each character individually
Constructs an intermediate list in memory
Converts the list to bytes
Line profiler shows: 28.7ms (92.4% of total time)

Optimized approach:

text_bytes = text.encode("latin-1")

Uses Python's built-in string encoding directly
Latin-1 encoding maps characters 0-255 to bytes 1:1 (identical to the original behavior)
No intermediate list creation
Line profiler shows: 106μs (4.1% of total time)
~270x faster on the critical line

Why This Works

The original function's purpose is to interpret a string where each character represents a byte value (ord 0-255), then decode those bytes using a specified encoding. Latin-1 encoding has the unique property that it directly maps Unicode codepoints 0-255 to bytes 0-255, making text.encode("latin-1") functionally equivalent to bytes([ord(char) for char in text]) but implemented in optimized C code.

Error Handling

Added a try-except block to preserve original behavior:

except UnicodeEncodeError:
    raise ValueError("bytes must be in range(0, 256)") from None

The original would raise ValueError if any character had ord > 255; the optimized version catches UnicodeEncodeError from encode() and converts it to the same ValueError.

Performance Impact by Test Category

Small strings (< 20 chars): 30-100% faster (microseconds saved)
Large strings (> 1000 chars): 5000-13000% faster (hundreds of microseconds saved)
- Example: 8000-char string goes from 428μs to 5.19μs (8141% faster)
- The performance gap grows linearly with string length due to eliminating the Python-level loop

The optimization is particularly impactful for any workload processing moderate-to-large strings, as the speedup scales directly with input size.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	✅ 20 Passed
🌀 Generated Regression Tests	✅ 50 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	✅ 1 Passed
📊 Tests Coverage	100.0%

⚙️ Click to see Existing Unit Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`cleaners/test_core.py::test_bytes_string_to_string`	6.51μs	3.73μs	74.6%✅

🌀 Click to see Generated Regression Tests

import pytest  # used for our unit tests

from unstructured.cleaners.core import bytes_string_to_string


def test_basic_ascii_default_utf8():
    # Basic ASCII string should be preserved with default utf-8 encoding
    input_text = "Hello, world!"  # ASCII characters only
    # The function builds bytes from ords and decodes with utf-8 by default
    codeflash_output = bytes_string_to_string(input_text)
    result = codeflash_output  # 5.68μs -> 2.90μs (95.7% faster)


def test_multibyte_utf8_sequence_decodes_to_unicode_char():
    # UTF-8 encoded 'é' is two bytes: 0xC3 0xA9. Build string with those byte values.
    input_text = chr(0xC3) + chr(0xA9)  # two chars with ord 195 and 169
    # Decoding these bytes as utf-8 should yield the single character 'é'
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 5.40μs -> 4.06μs (33.1% faster)


def test_encoding_normalization_underscore_and_case():
    # Ensure format_encoding_str normalizes underscores to hyphens and lowercases
    input_text = "AB"  # simple ASCII that decodes the same in many encodings
    # Use a mixed-case, underscore-containing encoding name
    codeflash_output = bytes_string_to_string(input_text, encoding="UTF_8")
    result = codeflash_output  # 4.80μs -> 3.65μs (31.5% faster)


def test_empty_string_returns_empty_string():
    # Empty input string should produce empty bytes and decode to empty string
    input_text = ""
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 3.83μs -> 3.19μs (20.1% faster)


def test_null_byte_handling_preserved():
    # Include a null byte in the middle; should be preserved after decode
    input_text = "A" + chr(0) + "B"
    # Decoding should yield a string containing the NUL character
    codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
    result = codeflash_output  # 4.66μs -> 3.33μs (39.9% faster)


def test_value_error_when_input_has_ord_over_255():
    # If any character has ord > 255, bytes([ord(char) ...]) must raise ValueError
    # Create a character with ord 256 (outside 0-255 byte range)
    input_text = chr(256)  # '\u0100'
    with pytest.raises(ValueError):
        codeflash_output = bytes_string_to_string(input_text, encoding="utf-8")
        _ = codeflash_output  # 3.63μs -> 4.81μs (24.5% slower)


def test_invalid_encoding_raises_lookup_error():
    # If the provided encoding (after formatting) is unknown, decode() should raise LookupError
    input_text = "A"  # simple byte that any codec could decode, but codec is invalid
    with pytest.raises(LookupError):
        codeflash_output = bytes_string_to_string(
            input_text, encoding="this-encoding-does-not-exist"
        )
        _ = codeflash_output  # 9.10μs -> 8.44μs (7.75% faster)


def test_annotation_removal_with_underscore_triggers_expected_codec():
    # The format_encoding_str removes directional annotations from certain encodings.
    # Use an annotated name with underscores which becomes one of annotated_encodings
    input_text = "AB"  # ASCII bytes that decode identically across many encodings
    # 'ISO_8859_6_I' -> 'iso-8859-6-i' -> trimmed to 'iso-8859-6'
    codeflash_output = bytes_string_to_string(input_text, encoding="ISO_8859_6_I")
    result = codeflash_output  # 9.36μs -> 8.03μs (16.5% faster)


def test_annotation_removal_with_hyphen_triggers_expected_codec():
    # Same as above but provide the hyphenated annotated form directly
    input_text = "AB"
    # 'iso-8859-6-i' should be transformed to 'iso-8859-6' by format_encoding_str logic
    codeflash_output = bytes_string_to_string(input_text, encoding="iso-8859-6-i")
    result = codeflash_output  # 8.87μs -> 7.55μs (17.5% faster)


def test_large_scale_latin1_identity_preserved():
    # Construct a large but bounded (<1000) sequence of byte-values as characters.
    # Using latin-1 (iso-8859-1) as the decoding ensures a one-to-one mapping of byte->codepoint.
    length = 500  # well under 1000 to keep test quick and deterministic
    # Build a string where each character's ord is i % 256
    large_input = "".join(chr(i % 256) for i in range(length))
    # Decoding via iso-8859-1 (latin-1) should produce a string whose codepoints equal the original ords
    codeflash_output = bytes_string_to_string(large_input, encoding="iso_8859_1")
    result = codeflash_output  # 32.7μs -> 3.83μs (753% faster)


def test_large_scale_utf8_roundtrip_for_ascii_subset():
    # Another large-scale test using ASCII-range bytes which are identical in UTF-8
    length = 700  # still under 1000
    # Build a repeating ASCII sequence (values 32..126) which are single-byte in UTF-8
    ascii_vals = [32 + (i % 95) for i in range(length)]  # printable ASCII range
    input_text = "".join(chr(v) for v in ascii_vals)
    # Decoding as utf-8 should reproduce the same characters
    codeflash_output = bytes_string_to_string(input_text, encoding="UTF_8")
    result = codeflash_output  # 43.3μs -> 3.84μs (1027% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.cleaners.core import bytes_string_to_string


def test_basic_utf8_decoding():
    """Test basic UTF-8 decoding with ASCII characters."""
    text = "hello"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.02μs -> 3.19μs (57.5% faster)


def test_basic_utf8_with_default_encoding():
    """Test UTF-8 decoding using default encoding parameter."""
    text = "world"
    codeflash_output = bytes_string_to_string(text)
    result = codeflash_output  # 4.70μs -> 2.78μs (69.1% faster)


def test_utf8_with_numbers():
    """Test UTF-8 decoding with numeric characters."""
    text = "test123"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.19μs -> 3.19μs (62.7% faster)


def test_utf8_with_special_characters():
    """Test UTF-8 decoding with special characters and punctuation."""
    text = "hello-world_test.txt"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 6.24μs -> 3.08μs (103% faster)


def test_utf8_with_spaces():
    """Test UTF-8 decoding with spaces between words."""
    text = "hello world test"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.92μs -> 3.21μs (84.3% faster)


def test_encoding_normalization_uppercase():
    """Test that encoding parameter is normalized from uppercase."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="UTF-8")
    result = codeflash_output  # 4.83μs -> 3.12μs (54.4% faster)


def test_encoding_normalization_underscore():
    """Test that encoding parameter is normalized from underscore to hyphen."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="utf_8")
    result = codeflash_output  # 4.82μs -> 3.33μs (44.9% faster)


def test_encoding_normalization_mixed_case_underscore():
    """Test encoding normalization with mixed case and underscores."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="UTF_8")
    result = codeflash_output  # 5.01μs -> 3.43μs (46.0% faster)


def test_iso_8859_1_encoding():
    """Test ISO-8859-1 (Latin-1) encoding."""
    text = "café"
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-1")
    result = codeflash_output  # 5.05μs -> 3.39μs (49.0% faster)


def test_iso_8859_1_normalization():
    """Test ISO-8859-1 encoding with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="ISO_8859_1")
    result = codeflash_output  # 5.23μs -> 3.63μs (44.1% faster)


def test_empty_string():
    """Test decoding an empty string."""
    text = ""
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 3.94μs -> 3.04μs (29.8% faster)


def test_single_character():
    """Test decoding a single character."""
    text = "a"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.45μs -> 3.15μs (41.2% faster)


def test_newline_character():
    """Test decoding a newline character."""
    text = "\n"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.29μs -> 3.16μs (35.7% faster)


def test_tab_character():
    """Test decoding a tab character."""
    text = "\t"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.42μs -> 3.17μs (39.3% faster)


def test_mixed_whitespace():
    """Test decoding mixed whitespace characters."""
    text = "hello\n\tworld"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.64μs -> 3.23μs (74.7% faster)


def test_iso_8859_6_e_normalization():
    """Test ISO-8859-6-E (Arabic with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -e annotation and use iso-8859-6
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-6-e")
    result = codeflash_output  # 9.30μs -> 7.84μs (18.5% faster)


def test_iso_8859_6_i_normalization():
    """Test ISO-8859-6-I (Arabic with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -i annotation and use iso-8859-6
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-6-i")
    result = codeflash_output  # 8.98μs -> 7.53μs (19.3% faster)


def test_iso_8859_8_e_normalization():
    """Test ISO-8859-8-E (Hebrew with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -e annotation and use iso-8859-8
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-8-e")
    result = codeflash_output  # 9.33μs -> 7.59μs (22.9% faster)


def test_iso_8859_8_i_normalization():
    """Test ISO-8859-8-I (Hebrew with directionality) encoding normalization."""
    text = "test"
    # The function should strip the -i annotation and use iso-8859-8
    codeflash_output = bytes_string_to_string(text, encoding="iso-8859-8-i")
    result = codeflash_output  # 9.09μs -> 7.48μs (21.5% faster)


def test_encoding_with_mixed_normalization():
    """Test encoding normalization combining case and underscore."""
    text = "data"
    codeflash_output = bytes_string_to_string(text, encoding="ISO_8859_6_E")
    result = codeflash_output  # 9.29μs -> 7.66μs (21.3% faster)


def test_us_ascii_encoding():
    """Test US-ASCII encoding."""
    text = "ASCII"
    codeflash_output = bytes_string_to_string(text, encoding="us-ascii")
    result = codeflash_output  # 5.27μs -> 3.38μs (56.1% faster)


def test_us_ascii_normalization():
    """Test US-ASCII encoding with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="US_ASCII")
    result = codeflash_output  # 5.16μs -> 3.64μs (41.7% faster)


def test_latin1_alias():
    """Test latin-1 alias for iso-8859-1."""
    text = "latin"
    codeflash_output = bytes_string_to_string(text, encoding="latin-1")
    result = codeflash_output  # 5.20μs -> 3.35μs (55.3% faster)


def test_latin1_alias_normalization():
    """Test latin-1 alias with underscore normalization."""
    text = "test"
    codeflash_output = bytes_string_to_string(text, encoding="LATIN_1")
    result = codeflash_output  # 5.01μs -> 3.58μs (39.8% faster)


def test_ascii_only_characters():
    """Test string with only ASCII characters (0-127 range)."""
    text = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 10.1μs -> 3.27μs (209% faster)


def test_repeated_characters():
    """Test string with repeated characters."""
    text = "aaaa"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 4.63μs -> 3.27μs (41.4% faster)


def test_alphanumeric_with_symbols():
    """Test alphanumeric characters mixed with symbols."""
    text = "abc!@#$%^&*()"
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 5.84μs -> 3.27μs (78.6% faster)


def test_large_string_ascii():
    """Test decoding a large string with ASCII characters."""
    # Create a large string with 5000 ASCII characters
    text = "a" * 5000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 277μs -> 5.12μs (5324% faster)


def test_large_string_repeated_pattern():
    """Test decoding a large string with repeated pattern."""
    # Create a pattern and repeat it 500 times
    pattern = "hello world "
    text = pattern * 500
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 319μs -> 4.92μs (6395% faster)


def test_large_string_with_numbers():
    """Test decoding a large string with numbers."""
    # Create a large string with 10000 numeric characters
    text = "0123456789" * 1000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 527μs -> 5.57μs (9362% faster)


def test_large_string_with_mixed_content():
    """Test decoding a large string with mixed content."""
    # Create a pattern with mixed characters
    pattern = "abc123XYZ!@#\n\t"
    text = pattern * 300
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 222μs -> 4.45μs (4902% faster)


def test_large_string_all_ascii_symbols():
    """Test decoding a large string with all printable ASCII symbols."""
    # Use all printable ASCII characters (32-126)
    ascii_chars = "".join(chr(i) for i in range(32, 127))
    text = ascii_chars * 100
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 492μs -> 5.43μs (8972% faster)


def test_large_string_default_encoding():
    """Test decoding a large string with default UTF-8 encoding."""
    text = "test_content_" * 500
    codeflash_output = bytes_string_to_string(text)
    result = codeflash_output  # 337μs -> 4.43μs (7525% faster)


def test_very_long_single_word():
    """Test decoding a very long single word."""
    text = "a" * 8000
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 428μs -> 5.19μs (8141% faster)


def test_large_string_with_sentence_structure():
    """Test decoding a large string with sentence structure."""
    sentence = "The quick brown fox jumps over the lazy dog. "
    text = sentence * 200
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 466μs -> 5.39μs (8546% faster)


def test_performance_consistent_across_encodings():
    """Test that performance is consistent for same text across different encoding formats."""
    text = "performance_test_" * 400

    # Test with different encoding formats
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result1 = codeflash_output  # 352μs -> 4.94μs (7034% faster)
    codeflash_output = bytes_string_to_string(text, encoding="UTF-8")
    result2 = codeflash_output  # 347μs -> 2.62μs (13133% faster)
    codeflash_output = bytes_string_to_string(text, encoding="utf_8")
    result3 = codeflash_output  # 349μs -> 2.55μs (13611% faster)


def test_large_multiline_string():
    """Test decoding a large multiline string."""
    line = "This is a line of text.\n"
    text = line * 500
    codeflash_output = bytes_string_to_string(text, encoding="utf-8")
    result = codeflash_output  # 620μs -> 6.12μs (10040% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from unstructured.cleaners.core import bytes_string_to_string


def test_bytes_string_to_string():
    bytes_string_to_string("", encoding="")

🔎 Click to see Concolic Coverage Tests

Test File::Test Function	Original ⏱️	Optimized ⏱️	Speedup
`codeflash_concolic_xdo_puqm/tmpl5i6ubkt/test_concolic_coverage.py::test_bytes_string_to_string`	3.89μs	3.01μs	29.0%✅

To edit these changes git checkout codeflash/optimize-bytes_string_to_string-mkrwky7e and push.

The optimized code achieves a **21x speedup** (2121%) by replacing an inefficient character-by-character byte construction with Python's native `encode()` method. ## Key Optimization **Original approach:** ```python text_bytes = bytes([ord(char) for char in text]) ``` - Creates a list comprehension iterating over every character - Calls `ord()` for each character individually - Constructs an intermediate list in memory - Converts the list to bytes - Line profiler shows: **28.7ms** (92.4% of total time) **Optimized approach:** ```python text_bytes = text.encode("latin-1") ``` - Uses Python's built-in string encoding directly - Latin-1 encoding maps characters 0-255 to bytes 1:1 (identical to the original behavior) - No intermediate list creation - Line profiler shows: **106μs** (4.1% of total time) - **~270x faster** on the critical line ## Why This Works The original function's purpose is to interpret a string where each character represents a byte value (ord 0-255), then decode those bytes using a specified encoding. Latin-1 encoding has the unique property that it directly maps Unicode codepoints 0-255 to bytes 0-255, making `text.encode("latin-1")` functionally equivalent to `bytes([ord(char) for char in text])` but implemented in optimized C code. ## Error Handling Added a try-except block to preserve original behavior: ```python except UnicodeEncodeError: raise ValueError("bytes must be in range(0, 256)") from None ``` The original would raise `ValueError` if any character had `ord > 255`; the optimized version catches `UnicodeEncodeError` from `encode()` and converts it to the same `ValueError`. ## Performance Impact by Test Category - **Small strings (< 20 chars)**: 30-100% faster (microseconds saved) - **Large strings (> 1000 chars)**: **5000-13000% faster** (hundreds of microseconds saved) - Example: 8000-char string goes from 428μs to 5.19μs (8141% faster) - The performance gap grows linearly with string length due to eliminating the Python-level loop The optimization is particularly impactful for any workload processing moderate-to-large strings, as the speedup scales directly with input size.

codeflash-ai bot requested a review from aseembits93 January 24, 2026 06:02

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Jan 24, 2026

misrasaurabh1 approved these changes Jan 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `bytes_string_to_string` by 2,121%#259

⚡️ Speed up function `bytes_string_to_string` by 2,121%#259
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-bytes_string_to_string-mkrwky7e

codeflash-ai bot commented Jan 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

codeflash-ai bot commented Jan 24, 2026

📄 2,121% (21.21x) speedup for bytes_string_to_string in unstructured/cleaners/core.py

📝 Explanation and details

Key Optimization

Why This Works

Error Handling

Performance Impact by Test Category

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

📄 2,121% (21.21x) speedup for `bytes_string_to_string` in `unstructured/cleaners/core.py`