Skip to content

⚡️ Speed up function string_to_uniform_number by 43%#43

Open
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-string_to_uniform_number-mgzg3qpo
Open

⚡️ Speed up function string_to_uniform_number by 43%#43
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-string_to_uniform_number-mgzg3qpo

Conversation

@codeflash-ai
Copy link
Copy Markdown

@codeflash-ai codeflash-ai bot commented Oct 20, 2025

📄 43% (0.43x) speedup for string_to_uniform_number in pr_agent/algo/utils.py

⏱️ Runtime : 4.96 milliseconds 3.47 milliseconds (best of 208 runs)

📝 Explanation and details

The optimization achieves a 42% speedup through three key changes:

  1. Eliminated expensive hexdigest() conversion: The original code converted hash bytes to hexadecimal string (hexdigest()) then parsed it back to integer with int(hex_string, 16). The optimized version directly converts hash bytes to integer using int.from_bytes(hash_bytes, 'big'), avoiding the costly string conversion entirely.

  2. Precomputed division constant: Instead of computing 2**256 - 1 on every function call, the optimized version precomputes _MAX_HASH_INT_INV = 1.0 / ((1 << 256) - 1) as a module-level constant and uses multiplication (hash_int * _MAX_HASH_INT_INV) instead of division, which is faster.

  3. More efficient bit shift: Uses (1 << 256) - 1 instead of 2 ** 256 - 1, though this is a minor optimization since it's precomputed.

The line profiler shows the most significant improvement in the hash conversion step - the original's int(hash_object.hexdigest(), 16) took 28.6% of execution time, while the optimized int.from_bytes(hash_bytes, 'big') takes only 22.6%. The precomputation eliminates the 18.3% time spent on max_hash_int = 2 ** 256 - 1 entirely.

These optimizations are particularly effective for the test cases shown, providing 18-48% speedups across various string inputs, with the largest gains (40-48%) on shorter strings where the hash computation overhead is more proportionally significant.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 6058 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests and Runtime
from __future__ import annotations

import hashlib

# imports
import pytest  # used for our unit tests
from pr_agent.algo.utils import string_to_uniform_number

# unit tests

# ===========================
# Basic Test Cases
# ===========================

def test_basic_ascii_string():
    # Test with a simple ASCII string
    codeflash_output = string_to_uniform_number("hello"); result = codeflash_output # 2.71μs -> 2.28μs (18.6% faster)

def test_basic_different_strings():
    # Different strings should produce different numbers (high probability)
    codeflash_output = string_to_uniform_number("apple"); r1 = codeflash_output # 2.84μs -> 2.26μs (25.6% faster)
    codeflash_output = string_to_uniform_number("banana"); r2 = codeflash_output # 1.11μs -> 779ns (42.6% faster)

def test_basic_empty_string():
    # Empty string should be valid and deterministic
    codeflash_output = string_to_uniform_number(""); result = codeflash_output # 2.85μs -> 2.54μs (12.4% faster)

def test_basic_numeric_string():
    # Numeric string input
    codeflash_output = string_to_uniform_number("123456"); result = codeflash_output # 3.02μs -> 2.40μs (25.4% faster)

def test_basic_unicode_string():
    # Unicode string input
    codeflash_output = string_to_uniform_number("你好世界"); result = codeflash_output # 3.29μs -> 2.64μs (24.6% faster)

def test_basic_special_characters():
    # String with special characters
    codeflash_output = string_to_uniform_number("!@#$%^&*()_+-=[]{}|;':,.<>/?"); result = codeflash_output # 2.98μs -> 2.39μs (24.5% faster)

def test_basic_case_sensitivity():
    # Case sensitivity: "hello" vs "HELLO"
    codeflash_output = string_to_uniform_number("hello"); r1 = codeflash_output # 3.02μs -> 2.26μs (33.3% faster)
    codeflash_output = string_to_uniform_number("HELLO"); r2 = codeflash_output # 1.17μs -> 883ns (31.9% faster)

# ===========================
# Edge Test Cases
# ===========================

def test_edge_long_string():
    # Very long string (1000 chars)
    long_str = "a" * 1000
    codeflash_output = string_to_uniform_number(long_str); result = codeflash_output # 3.54μs -> 3.08μs (15.1% faster)

def test_edge_single_character():
    # Single character string
    codeflash_output = string_to_uniform_number("x"); result = codeflash_output # 2.91μs -> 2.25μs (29.4% faster)

def test_edge_whitespace():
    # String with only whitespace
    codeflash_output = string_to_uniform_number("   "); result = codeflash_output # 3.07μs -> 2.39μs (28.3% faster)
    # Different whitespace counts produce different numbers
    codeflash_output = string_to_uniform_number("    "); r2 = codeflash_output # 1.22μs -> 822ns (47.9% faster)

def test_edge_null_byte():
    # String with null byte
    codeflash_output = string_to_uniform_number("\x00"); result = codeflash_output # 2.82μs -> 2.25μs (25.5% faster)

def test_edge_max_unicode():
    # String with max unicode character
    codeflash_output = string_to_uniform_number(chr(0x10FFFF)); result = codeflash_output # 2.93μs -> 2.35μs (24.7% faster)

def test_edge_repeated_pattern():
    # Repeated pattern string
    s1 = "abc" * 100
    s2 = "abc" * 101
    codeflash_output = string_to_uniform_number(s1) # 3.14μs -> 2.44μs (28.5% faster)

def test_edge_non_string_input():
    # Should raise AttributeError if input is not a string (e.g., int)
    with pytest.raises(AttributeError):
        string_to_uniform_number(123) # 1.31μs -> 1.29μs (0.927% faster)

def test_edge_bytes_input():
    # Should raise AttributeError if input is bytes (not str)
    with pytest.raises(AttributeError):
        string_to_uniform_number(b"hello") # 1.20μs -> 1.17μs (1.79% faster)

def test_edge_empty_vs_space():
    # Empty string vs single space should be different
    codeflash_output = string_to_uniform_number("") # 3.82μs -> 3.15μs (21.2% faster)

# ===========================
# Large Scale Test Cases
# ===========================

def test_large_scale_determinism():
    # Test that the function is deterministic for many random strings
    base_str = "test_string_"
    results = []
    for i in range(1000):
        s = base_str + str(i)
        codeflash_output = string_to_uniform_number(s); val1 = codeflash_output # 810μs -> 564μs (43.7% faster)
        codeflash_output = string_to_uniform_number(s); val2 = codeflash_output # 805μs -> 559μs (44.0% faster)
        results.append(val1)


def test_large_scale_performance():
    # Test that function completes within reasonable time for 1000 elements
    import time
    base_str = "perf_test_"
    start = time.time()
    for i in range(1000):
        string_to_uniform_number(base_str + str(i)) # 804μs -> 561μs (43.5% faster)
    end = time.time()


def test_unicode_combining_characters():
    # Test with combining unicode characters
    s1 = "e\u0301"  # é as 'e' + combining acute
    s2 = "\u00e9"   # é as single codepoint
    # Should produce different results
    codeflash_output = string_to_uniform_number(s1) # 7.63μs -> 6.85μs (11.5% faster)

def test_string_with_emoji():
    # String with emoji
    codeflash_output = string_to_uniform_number("hello 😊"); result = codeflash_output # 3.86μs -> 3.24μs (19.3% faster)

def test_string_with_newline_and_tabs():
    # String with newline and tab characters
    codeflash_output = string_to_uniform_number("line1\nline2\tend"); result = codeflash_output # 3.38μs -> 2.67μs (26.4% faster)

def test_string_with_surrogate_pairs():
    # String with surrogate pairs (high unicode)
    s = "\U0001F600"  # 😀
    codeflash_output = string_to_uniform_number(s); result = codeflash_output # 3.30μs -> 2.75μs (20.0% faster)

def test_extremely_similar_strings():
    # Two strings differing by one character
    s1 = "abcdefg"
    s2 = "abcdefh"
    codeflash_output = string_to_uniform_number(s1) # 3.26μs -> 2.61μs (24.6% faster)

def test_extremely_large_string():
    # String at upper limit of reasonable size (1000 chars)
    s = "x" * 1000
    codeflash_output = string_to_uniform_number(s); result = codeflash_output # 4.04μs -> 3.36μs (20.4% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations

import hashlib

# imports
import pytest  # used for our unit tests
from pr_agent.algo.utils import string_to_uniform_number

# unit tests

# 1. Basic Test Cases

def test_basic_different_strings_produce_different_numbers():
    # Different strings should yield different numbers
    codeflash_output = string_to_uniform_number("apple") # 3.73μs -> 2.96μs (26.0% faster)
    codeflash_output = string_to_uniform_number("123") # 1.19μs -> 874ns (36.6% faster)
    codeflash_output = string_to_uniform_number("hello") # 865ns -> 603ns (43.4% faster)

def test_basic_same_string_same_number():
    # The same string should always yield the same number
    s = "consistent"
    codeflash_output = string_to_uniform_number(s); n1 = codeflash_output # 2.73μs -> 2.15μs (27.1% faster)
    codeflash_output = string_to_uniform_number(s); n2 = codeflash_output # 1.00μs -> 717ns (39.7% faster)

def test_basic_output_in_range():
    # The output should always be in [0, 1], inclusive
    for s in ["test", "another", "123", ""]:
        codeflash_output = string_to_uniform_number(s); n = codeflash_output # 5.97μs -> 4.57μs (30.7% faster)

def test_basic_known_values():
    # Known input/output pairs (SHA-256 is deterministic)
    # These values are hardcoded for regression testing
    codeflash_output = string_to_uniform_number("") # 2.71μs -> 2.24μs (20.7% faster)
    codeflash_output = string_to_uniform_number("a") # 1.21μs -> 818ns (47.3% faster)
    codeflash_output = string_to_uniform_number("abc") # 1.03μs -> 739ns (39.6% faster)

# 2. Edge Test Cases

def test_edge_empty_string():
    # Empty string should still return a valid float in [0,1]
    codeflash_output = string_to_uniform_number(""); n = codeflash_output # 2.64μs -> 2.16μs (22.1% faster)

def test_edge_long_string():
    # Very long string should not cause errors and should be in [0,1]
    long_str = "a" * 1000
    codeflash_output = string_to_uniform_number(long_str); n = codeflash_output # 3.50μs -> 2.94μs (19.1% faster)

def test_edge_unicode_string():
    # Unicode characters should be handled correctly
    s = "测试🌟"
    codeflash_output = string_to_uniform_number(s); n = codeflash_output # 2.87μs -> 2.36μs (21.6% faster)

def test_edge_case_sensitivity():
    # Case changes should affect the output
    codeflash_output = string_to_uniform_number("Case") # 2.79μs -> 2.19μs (27.6% faster)

def test_edge_special_characters():
    # Special characters should be handled and affect output
    codeflash_output = string_to_uniform_number("!@#$%^&*()") # 2.71μs -> 2.30μs (18.1% faster)
    codeflash_output = string_to_uniform_number("\n\t\r") # 1.25μs -> 882ns (41.5% faster)

def test_edge_maximum_and_minimum_possible_output():
    # It's extremely unlikely to hit exactly 0.0 or 1.0, but output should never exceed [0,1]
    for s in ["", "a", "Z"*100, " " * 50, "0"*100]:
        codeflash_output = string_to_uniform_number(s); n = codeflash_output # 6.86μs -> 5.19μs (32.2% faster)
    # Check that the output is never exactly 1.0 or 0.0 for these cases
    for s in ["", "abc", "def"]:
        codeflash_output = string_to_uniform_number(s); n = codeflash_output # 2.50μs -> 1.78μs (40.9% faster)

def test_edge_non_ascii_bytes():
    # Non-ASCII but valid unicode should not raise
    s = "ñöñ-ÅŠÇÍÍ"
    codeflash_output = string_to_uniform_number(s); n = codeflash_output # 2.79μs -> 2.27μs (22.9% faster)

def test_edge_hash_collision_impossible():
    # For practical purposes, two different strings should never yield the same number
    s1 = "collision_test_1"
    s2 = "collision_test_2"
    codeflash_output = string_to_uniform_number(s1) # 2.54μs -> 2.03μs (24.9% faster)

# 3. Large Scale Test Cases

def test_large_scale_many_unique_inputs():
    # Generate 1000 unique strings and ensure all outputs are unique
    numbers = set()
    for i in range(1000):
        s = f"item_{i}"
        codeflash_output = string_to_uniform_number(s); n = codeflash_output # 803μs -> 558μs (43.7% faster)
        numbers.add(n)

def test_large_scale_uniformity_distribution():
    # Test that outputs are roughly uniformly distributed
    # (not a statistical test, but checks for clustering)
    buckets = [0] * 10
    for i in range(1000):
        s = f"uniform_{i}"
        codeflash_output = string_to_uniform_number(s); n = codeflash_output # 804μs -> 563μs (42.8% faster)
        bucket = int(n * 10)
        if bucket == 10:
            bucket = 9  # edge case for n==1.0
        buckets[bucket] += 1

def test_large_scale_performance():
    # Ensure function does not take excessive time for many calls
    import time
    start = time.time()
    for i in range(1000):
        s = f"perf_{i}"
        codeflash_output = string_to_uniform_number(s); _ = codeflash_output # 800μs -> 556μs (43.9% faster)
    elapsed = time.time() - start

# Additional edge: test for non-string input raises (type safety)
def test_non_string_input_raises():
    # Should raise AttributeError because .encode() is not available
    with pytest.raises(AttributeError):
        string_to_uniform_number(123) # 1.50μs -> 1.43μs (5.19% faster)
    with pytest.raises(AttributeError):
        string_to_uniform_number(None) # 839ns -> 841ns (0.238% slower)
    with pytest.raises(AttributeError):
        string_to_uniform_number(["list"]) # 737ns -> 760ns (3.03% slower)

# Additional edge: test for very similar strings
def test_very_similar_strings():
    # Changing just one character should change the result
    codeflash_output = string_to_uniform_number("abcdefg"); n1 = codeflash_output # 5.35μs -> 4.83μs (10.7% faster)
    codeflash_output = string_to_uniform_number("abcdefh"); n2 = codeflash_output # 1.12μs -> 750ns (48.8% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-string_to_uniform_number-mgzg3qpo and push.

Codeflash

The optimization achieves a 42% speedup through three key changes:

1. **Eliminated expensive hexdigest() conversion**: The original code converted hash bytes to hexadecimal string (`hexdigest()`) then parsed it back to integer with `int(hex_string, 16)`. The optimized version directly converts hash bytes to integer using `int.from_bytes(hash_bytes, 'big')`, avoiding the costly string conversion entirely.

2. **Precomputed division constant**: Instead of computing `2**256 - 1` on every function call, the optimized version precomputes `_MAX_HASH_INT_INV = 1.0 / ((1 << 256) - 1)` as a module-level constant and uses multiplication (`hash_int * _MAX_HASH_INT_INV`) instead of division, which is faster.

3. **More efficient bit shift**: Uses `(1 << 256) - 1` instead of `2 ** 256 - 1`, though this is a minor optimization since it's precomputed.

The line profiler shows the most significant improvement in the hash conversion step - the original's `int(hash_object.hexdigest(), 16)` took 28.6% of execution time, while the optimized `int.from_bytes(hash_bytes, 'big')` takes only 22.6%. The precomputation eliminates the 18.3% time spent on `max_hash_int = 2 ** 256 - 1` entirely.

These optimizations are particularly effective for the test cases shown, providing 18-48% speedups across various string inputs, with the largest gains (40-48%) on shorter strings where the hash computation overhead is more proportionally significant.
@codeflash-ai codeflash-ai bot requested a review from mashraf-222 October 20, 2025 18:05
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants