⚡️ Speed up function `numpy_matmul` by 5,996% by codeflash-ai[bot] · Pull Request #239 · codeflash-ai/optimize-me

codeflash-ai · 2025-12-30T09:03:45Z

📄 5,996% (59.96x) speedup for `numpy_matmul` in `src/numerical/linear_algebra.py`

⏱️ Runtime : 38.9 seconds → 638 milliseconds (best of 12 runs)

📝 Explanation and details

The optimized code achieves a ~60x speedup by replacing the innermost loop with NumPy's np.dot() function. This is a critical optimization because:

What Changed:

Original: Triple nested loop with element-wise operations: result[i, j] += A[i, k] * B[k, j]
Optimized: Two nested loops with vectorized dot product: result[i, j] = np.dot(A[i, :], B[:, j])

Why It's Faster:

Eliminates ~50 million Python loop iterations: The innermost loop (accounting for 30% of runtime in profiling) is removed entirely, replaced by a single vectorized operation
Leverages optimized BLAS libraries: np.dot() calls highly optimized low-level linear algebra routines (BLAS) written in C/Fortran, which use SIMD instructions and cache-efficient algorithms
Reduces array indexing overhead: Instead of ~50 million individual array accesses (A[i, k] * B[k, j]), the code performs ~700K dot products on array slices, dramatically reducing Python interpreter overhead
Better memory access patterns: Vectorized operations have better cache locality compared to scattered element-wise access

Performance Characteristics from Tests:

Small matrices (2x2, 3x3): Mixed results (some 10-30% slower) due to function call overhead dominating for tiny workloads
Medium matrices (100x100): 4411% faster - sweet spot where vectorization overhead is amortized
Large matrices (500x200 * 200x300): 8683% faster - massive gains as BLAS optimizations fully activate
Sparse matrices: 12497% faster - vectorized operations handle zeros efficiently without branching
Vector operations (1x500 * 500x1): 5904% faster - dot products are optimal for this pattern

Trade-offs:

Slightly slower for very small matrices (1x1, small 2x2) where function call overhead exceeds loop savings
Minor slowdown for outer product patterns (column × row vectors) where the original loop structure was more natural

The optimization is highly effective for real-world matrix operations (typically involving matrices >10x10), making it suitable for numerical computing, machine learning, and scientific applications where matrix multiplication is in performance-critical paths.

✅ Correctness verification report:

Test	Status
⚙️ Existing Unit Tests	🔘 None Found
🌀 Generated Regression Tests	✅ 59 Passed
⏪ Replay Tests	🔘 None Found
🔎 Concolic Coverage Tests	🔘 None Found
📊 Tests Coverage	100.0%

🌀 Click to see Generated Regression Tests

import numpy as np

# imports
import pytest  # used for our unit tests
from src.numerical.linear_algebra import numpy_matmul

# unit tests

# -------- BASIC TEST CASES --------


def test_matmul_identity_matrix():
    # Multiplying any matrix by an identity matrix should return the original matrix
    A = np.array([[1, 2], [3, 4]])
    I = np.eye(2)
    codeflash_output = numpy_matmul(A, I)
    result = codeflash_output  # 11.0μs -> 9.75μs (13.3% faster)
    # Also test the other order
    codeflash_output = numpy_matmul(I, A)
    result2 = codeflash_output  # 5.00μs -> 5.92μs (15.5% slower)


def test_matmul_zero_matrix():
    # Multiplying any matrix by a zero matrix should return a zero matrix of appropriate shape
    A = np.array([[1, 2], [3, 4]])
    Z = np.zeros((2, 2))
    codeflash_output = numpy_matmul(A, Z)
    result = codeflash_output  # 6.79μs -> 7.79μs (12.8% slower)
    codeflash_output = numpy_matmul(Z, A)
    result2 = codeflash_output  # 4.92μs -> 5.62μs (12.6% slower)


def test_matmul_basic_2x2():
    # Test known multiplication result
    A = np.array([[2, 3], [0, 1]])
    B = np.array([[1, 2], [3, 4]])
    expected = np.array(
        [[2 * 1 + 3 * 3, 2 * 2 + 3 * 4], [0 * 1 + 1 * 3, 0 * 2 + 1 * 4]]
    )
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.62μs -> 8.33μs (20.5% slower)


def test_matmul_rectangular():
    # Test multiplication of 2x3 and 3x2 matrices
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[7, 8], [9, 10], [11, 12]])
    expected = np.array(
        [
            [1 * 7 + 2 * 9 + 3 * 11, 1 * 8 + 2 * 10 + 3 * 12],
            [4 * 7 + 5 * 9 + 6 * 11, 4 * 8 + 5 * 10 + 6 * 12],
        ]
    )
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.29μs -> 7.33μs (13.1% faster)


def test_matmul_vector_by_matrix():
    # Test multiplication of 1x3 (row vector) by 3x1 (column vector)
    A = np.array([[1, 2, 3]])
    B = np.array([[4], [5], [6]])
    expected = np.array([[1 * 4 + 2 * 5 + 3 * 6]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 3.67μs -> 3.58μs (2.34% faster)


def test_matmul_matrix_by_vector():
    # Test multiplication of 3x1 by 1x3 (should result in 3x3 matrix)
    A = np.array([[1], [2], [3]])
    B = np.array([[4, 5, 6]])
    expected = np.array(
        [[1 * 4, 1 * 5, 1 * 6], [2 * 4, 2 * 5, 2 * 6], [3 * 4, 3 * 5, 3 * 6]]
    )
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.83μs -> 12.8μs (46.4% slower)


# -------- EDGE TEST CASES --------


def test_matmul_incompatible_shapes():
    # Test that multiplying matrices with incompatible shapes raises ValueError
    A = np.ones((2, 3))
    B = np.ones((4, 2))
    with pytest.raises(ValueError):
        numpy_matmul(A, B)  # 1.42μs -> 1.21μs (17.2% faster)


def test_matmul_empty_matrices():
    # Test multiplying empty matrices
    A = np.zeros((0, 3))
    B = np.zeros((3, 0))
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 1.38μs -> 1.21μs (13.7% faster)


def test_matmul_single_element():
    # Test multiplying 1x1 matrices (scalars)
    A = np.array([[7]])
    B = np.array([[3]])
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 2.67μs -> 3.83μs (30.4% slower)


def test_matmul_negative_and_float_values():
    # Test multiplication with negative and float values
    A = np.array([[1.5, -2.0], [3.1, 0.0]])
    B = np.array([[0.0, 1.0], [-1.0, 2.5]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.92μs -> 6.54μs (9.56% slower)


def test_matmul_non_square():
    # Test non-square matrices, e.g., 3x2 * 2x4
    A = np.array([[1, 2], [3, 4], [5, 6]])
    B = np.array([[7, 8, 9, 10], [11, 12, 13, 14]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 14.2μs -> 15.7μs (9.57% slower)


def test_matmul_large_numbers():
    # Test with very large numbers to check for overflow/precision
    A = np.array([[1e100, 2e100], [3e100, 4e100]])
    B = np.array([[5e100, 6e100], [7e100, 8e100]])
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.79μs -> 6.08μs (4.78% slower)


def test_matmul_with_zeros_in_input():
    # Test when one matrix has all zeros except one element
    A = np.zeros((3, 3))
    A[1, 2] = 5
    B = np.zeros((3, 3))
    B[2, 0] = 4
    expected = np.zeros((3, 3))
    expected[1, 0] = 20
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 13.8μs -> 10.5μs (31.9% faster)


# -------- LARGE SCALE TEST CASES --------


def test_matmul_large_square_matrix():
    # Test multiplication of two large square matrices (e.g., 100x100)
    np.random.seed(0)
    A = np.random.rand(100, 100)
    B = np.random.rand(100, 100)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 394ms -> 8.74ms (4411% faster)


def test_matmul_large_rectangular_matrix():
    # Test multiplication of large rectangular matrices (e.g., 100x50 * 50x80)
    np.random.seed(1)
    A = np.random.rand(100, 50)
    B = np.random.rand(50, 80)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 156ms -> 6.85ms (2187% faster)


def test_matmul_large_sparse_matrix():
    # Test multiplication where most elements are zero
    np.random.seed(2)
    A = np.zeros((200, 300))
    B = np.zeros((300, 100))
    # Set a few random elements
    for _ in range(10):
        i, j = np.random.randint(0, 200), np.random.randint(0, 300)
        A[i, j] = np.random.rand()
    for _ in range(10):
        i, j = np.random.randint(0, 300), np.random.randint(0, 100)
        B[i, j] = np.random.rand()
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 2.39s -> 19.0ms (12497% faster)


def test_matmul_large_vector_matrix():
    # Test multiplication of a large vector (1x500) by a matrix (500x1)
    np.random.seed(3)
    A = np.random.rand(1, 500)
    B = np.random.rand(500, 1)
    expected = np.matmul(A, B)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 202μs -> 3.38μs (5904% faster)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

import numpy as np

# imports
import pytest  # used for our unit tests
from src.numerical.linear_algebra import numpy_matmul

# unit tests

# ============================================================================
# BASIC FUNCTIONALITY TESTS
# ============================================================================


def test_basic_2x2_multiplication():
    """Test basic 2x2 matrix multiplication with simple integers."""
    # Create two 2x2 matrices
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[5, 6], [7, 8]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.21μs -> 7.62μs (7.66% faster)

    # Expected result: [[1*5+2*7, 1*6+2*8], [3*5+4*7, 3*6+4*8]]
    # = [[19, 22], [43, 50]]
    expected = np.array([[19, 22], [43, 50]])


def test_rectangular_2x3_times_3x2():
    """Test multiplication of rectangular matrices (2x3 * 3x2)."""
    # Create a 2x3 matrix
    A = np.array([[1, 2, 3], [4, 5, 6]])
    # Create a 3x2 matrix
    B = np.array([[7, 8], [9, 10], [11, 12]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 8.42μs -> 7.21μs (16.8% faster)

    # Expected: [[1*7+2*9+3*11, 1*8+2*10+3*12], [4*7+5*9+6*11, 4*8+5*10+6*12]]
    # = [[58, 64], [139, 154]]
    expected = np.array([[58, 64], [139, 154]])


def test_rectangular_3x4_times_4x5():
    """Test multiplication of larger rectangular matrices (3x4 * 4x5)."""
    # Create a 3x4 matrix with sequential values
    A = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
    # Create a 4x5 matrix with sequential values
    B = np.array(
        [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12, 13, 14, 15], [16, 17, 18, 19, 20]]
    )

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 30.7μs -> 19.2μs (59.7% faster)
    # Verify against numpy's built-in matmul for correctness
    expected = np.matmul(A, B)


# ============================================================================
# EDGE CASES - MATRIX DIMENSIONS
# ============================================================================


def test_single_element_matrices():
    """Test multiplication of 1x1 matrices (scalars)."""
    # Create two 1x1 matrices
    A = np.array([[5]])
    B = np.array([[7]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 2.50μs -> 3.62μs (31.0% slower)


def test_row_vector_times_column_vector():
    """Test row vector (1xN) times column vector (Nx1) = scalar."""
    # Create a 1x4 row vector
    A = np.array([[1, 2, 3, 4]])
    # Create a 4x1 column vector
    B = np.array([[5], [6], [7], [8]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 3.92μs -> 3.54μs (10.6% faster)


def test_column_vector_times_row_vector():
    """Test column vector (Nx1) times row vector (1xN) = NxN matrix."""
    # Create a 3x1 column vector
    A = np.array([[2], [3], [4]])
    # Create a 1x3 row vector
    B = np.array([[5, 6, 7]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.96μs -> 12.8μs (45.4% slower)

    # Expected: outer product [[10, 12, 14], [15, 18, 21], [20, 24, 28]]
    expected = np.array([[10, 12, 14], [15, 18, 21], [20, 24, 28]])


def test_very_thin_matrices():
    """Test multiplication of very thin matrices (100x1 * 1x100)."""
    # Create a 100x1 column vector
    A = np.arange(100).reshape(100, 1)
    # Create a 1x100 row vector
    B = np.arange(100).reshape(1, 100)

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 4.86ms -> 10.1ms (51.7% slower)


# ============================================================================
# EDGE CASES - MATRIX VALUES
# ============================================================================


def test_zero_matrices():
    """Test multiplication with zero matrices."""
    # Create two 3x3 zero matrices
    A = np.zeros((3, 3))
    B = np.zeros((3, 3))

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 13.6μs -> 10.8μs (25.4% faster)


def test_zero_matrix_times_nonzero():
    """Test zero matrix times non-zero matrix."""
    # Create a zero matrix
    A = np.zeros((2, 3))
    # Create a non-zero matrix
    B = np.array([[1, 2], [3, 4], [5, 6]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 7.79μs -> 8.21μs (5.07% slower)


def test_identity_matrix_multiplication():
    """Test multiplication with identity matrix (A * I = A)."""
    # Create a 3x3 matrix
    A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    # Create a 3x3 identity matrix
    I = np.eye(3)

    # Perform multiplication A * I
    codeflash_output = numpy_matmul(A, I)
    result = codeflash_output  # 16.4μs -> 13.0μs (26.0% faster)


def test_identity_matrix_left_multiplication():
    """Test multiplication with identity matrix on left (I * A = A)."""
    # Create a 3x3 identity matrix
    I = np.eye(3)
    # Create a 3x3 matrix
    A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

    # Perform multiplication I * A
    codeflash_output = numpy_matmul(I, A)
    result = codeflash_output  # 14.5μs -> 13.0μs (11.6% faster)


def test_negative_values():
    """Test multiplication with negative values."""
    # Create matrices with negative values
    A = np.array([[-1, 2], [-3, 4]])
    B = np.array([[5, -6], [7, -8]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 6.04μs -> 7.42μs (18.5% slower)

    # Expected: [[-1*5+2*7, -1*(-6)+2*(-8)], [-3*5+4*7, -3*(-6)+4*(-8)]]
    # = [[9, -10], [13, -14]]
    expected = np.array([[9, -10], [13, -14]])


def test_floating_point_values():
    """Test multiplication with floating point values."""
    # Create matrices with floating point values
    A = np.array([[1.5, 2.5], [3.5, 4.5]])
    B = np.array([[0.5, 1.5], [2.5, 3.5]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.38μs -> 6.17μs (12.8% slower)

    # Expected: [[1.5*0.5+2.5*2.5, 1.5*1.5+2.5*3.5], [3.5*0.5+4.5*2.5, 3.5*1.5+4.5*3.5]]
    # = [[7.0, 11.0], [13.0, 21.0]]
    expected = np.array([[7.0, 11.0], [13.0, 21.0]])


def test_large_values():
    """Test multiplication with very large values."""
    # Create matrices with large values
    A = np.array([[1e6, 2e6], [3e6, 4e6]])
    B = np.array([[5e6, 6e6], [7e6, 8e6]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.46μs -> 6.08μs (10.3% slower)
    # Verify against numpy's matmul
    expected = np.matmul(A, B)


def test_small_values():
    """Test multiplication with very small values."""
    # Create matrices with small values
    A = np.array([[1e-6, 2e-6], [3e-6, 4e-6]])
    B = np.array([[5e-6, 6e-6], [7e-6, 8e-6]])

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 5.46μs -> 6.04μs (9.65% slower)
    # Verify against numpy's matmul
    expected = np.matmul(A, B)


# ============================================================================
# ERROR CASES
# ============================================================================


def test_incompatible_dimensions_2x3_times_2x3():
    """Test that incompatible dimensions raise ValueError (2x3 * 2x3)."""
    # Create two 2x3 matrices (incompatible for multiplication)
    A = np.array([[1, 2, 3], [4, 5, 6]])
    B = np.array([[7, 8, 9], [10, 11, 12]])

    # Should raise ValueError because cols_A (3) != rows_B (2)
    with pytest.raises(ValueError) as excinfo:
        numpy_matmul(A, B)  # 1.21μs -> 1.00μs (20.8% faster)


def test_incompatible_dimensions_3x2_times_3x2():
    """Test that incompatible dimensions raise ValueError (3x2 * 3x2)."""
    # Create two 3x2 matrices (incompatible for multiplication)
    A = np.array([[1, 2], [3, 4], [5, 6]])
    B = np.array([[7, 8], [9, 10], [11, 12]])

    # Should raise ValueError because cols_A (2) != rows_B (3)
    with pytest.raises(ValueError) as excinfo:
        numpy_matmul(A, B)  # 834ns -> 917ns (9.05% slower)


def test_incompatible_dimensions_4x5_times_3x4():
    """Test that incompatible dimensions raise ValueError (4x5 * 3x4)."""
    # Create incompatible matrices
    A = np.ones((4, 5))
    B = np.ones((3, 4))

    # Should raise ValueError because cols_A (5) != rows_B (3)
    with pytest.raises(ValueError) as excinfo:
        numpy_matmul(A, B)  # 917ns -> 875ns (4.80% faster)


def test_incompatible_dimensions_1x5_times_3x1():
    """Test that incompatible dimensions raise ValueError (1x5 * 3x1)."""
    # Create incompatible matrices
    A = np.array([[1, 2, 3, 4, 5]])
    B = np.array([[6], [7], [8]])

    # Should raise ValueError because cols_A (5) != rows_B (3)
    with pytest.raises(ValueError) as excinfo:
        numpy_matmul(A, B)  # 833ns -> 792ns (5.18% faster)


# ============================================================================
# MATHEMATICAL PROPERTIES
# ============================================================================


def test_non_commutativity():
    """Test that matrix multiplication is not commutative (A*B != B*A)."""
    # Create two non-square matrices where both A*B and B*A are valid
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[5, 6], [7, 8]])

    # Compute A*B and B*A
    codeflash_output = numpy_matmul(A, B)
    result_AB = codeflash_output  # 6.38μs -> 7.67μs (16.8% slower)
    codeflash_output = numpy_matmul(B, A)
    result_BA = codeflash_output  # 4.75μs -> 5.33μs (10.9% slower)


def test_associativity():
    """Test that matrix multiplication is associative ((A*B)*C = A*(B*C))."""
    # Create three matrices
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[5, 6], [7, 8]])
    C = np.array([[9, 10], [11, 12]])

    # Compute (A*B)*C
    codeflash_output = numpy_matmul(A, B)
    AB = codeflash_output  # 6.12μs -> 7.12μs (14.0% slower)
    codeflash_output = numpy_matmul(AB, C)
    result_ABC_left = codeflash_output  # 4.79μs -> 6.38μs (24.8% slower)

    # Compute A*(B*C)
    codeflash_output = numpy_matmul(B, C)
    BC = codeflash_output  # 4.83μs -> 5.33μs (9.38% slower)
    codeflash_output = numpy_matmul(A, BC)
    result_ABC_right = codeflash_output  # 5.08μs -> 5.79μs (12.2% slower)


def test_distributivity():
    """Test distributive property: A*(B+C) = A*B + A*C."""
    # Create three matrices
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[5, 6], [7, 8]])
    C = np.array([[9, 10], [11, 12]])

    # Compute A*(B+C)
    B_plus_C = B + C
    codeflash_output = numpy_matmul(A, B_plus_C)
    result_left = codeflash_output  # 6.12μs -> 7.08μs (13.5% slower)

    # Compute A*B + A*C
    codeflash_output = numpy_matmul(A, B)
    AB = codeflash_output  # 4.62μs -> 5.17μs (10.5% slower)
    codeflash_output = numpy_matmul(A, C)
    AC = codeflash_output  # 4.58μs -> 5.08μs (9.82% slower)
    result_right = AB + AC


def test_scalar_multiplication_property():
    """Test scalar multiplication property: (kA)*B = k(A*B)."""
    # Create two matrices and a scalar
    A = np.array([[1, 2], [3, 4]])
    B = np.array([[5, 6], [7, 8]])
    k = 3

    # Compute (kA)*B
    kA = k * A
    codeflash_output = numpy_matmul(kA, B)
    result_left = codeflash_output  # 5.96μs -> 6.96μs (14.4% slower)

    # Compute k(A*B)
    codeflash_output = numpy_matmul(A, B)
    AB = codeflash_output  # 4.58μs -> 5.12μs (10.6% slower)
    result_right = k * AB


# ============================================================================
# LARGE SCALE TESTS
# ============================================================================


def test_large_square_matrices_100x100():
    """Test multiplication of large square matrices (100x100 * 100x100)."""
    # Create two 100x100 matrices with random values
    np.random.seed(42)  # For reproducibility
    A = np.random.rand(100, 100)
    B = np.random.rand(100, 100)

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 393ms -> 8.72ms (4410% faster)

    # Verify against numpy's matmul for correctness
    expected = np.matmul(A, B)


def test_large_rectangular_matrices_500x200_times_200x300():
    """Test multiplication of large rectangular matrices (500x200 * 200x300)."""
    # Create large rectangular matrices
    np.random.seed(123)  # For reproducibility
    A = np.random.rand(500, 200)
    B = np.random.rand(200, 300)

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 11.9s -> 135ms (8683% faster)

    # Verify against numpy's matmul for correctness (sample a few elements)
    expected = np.matmul(A, B)


def test_very_wide_matrices_10x500_times_500x10():
    """Test multiplication of very wide matrices (10x500 * 500x10)."""
    # Create very wide matrices
    np.random.seed(456)  # For reproducibility
    A = np.random.rand(10, 500)
    B = np.random.rand(500, 10)

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 20.0ms -> 111μs (17851% faster)

    # Verify against numpy's matmul for correctness
    expected = np.matmul(A, B)


def test_large_scale_with_specific_pattern():
    """Test large scale multiplication with matrices having specific patterns."""
    # Create a 200x200 matrix with a diagonal pattern
    A = np.eye(200) * 2
    # Create a 200x200 matrix with sequential values
    B = np.arange(200 * 200).reshape(200, 200)

    # Perform multiplication
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 3.44s -> 50.7ms (6687% faster)

    # Since A is 2*I, result should be 2*B
    expected = 2 * B


def test_large_scale_identity_property():
    """Test identity property with large matrices."""
    # Create a large 300x300 random matrix
    np.random.seed(789)
    A = np.random.rand(300, 300)
    # Create a 300x300 identity matrix
    I = np.eye(300)

    # Perform multiplication A * I
    codeflash_output = numpy_matmul(A, I)
    result = codeflash_output  # 10.7s -> 85.6ms (12424% faster)


def test_large_scale_zero_property():
    """Test zero property with large matrices."""
    # Create a large 250x250 random matrix
    np.random.seed(101)
    A = np.random.rand(250, 250)
    # Create a 250x250 zero matrix
    Z = np.zeros((250, 250))

    # Perform multiplication A * Z
    codeflash_output = numpy_matmul(A, Z)
    result = codeflash_output  # 6.18s -> 57.0ms (10734% faster)


def test_large_scale_chain_multiplication():
    """Test chain multiplication with moderately large matrices."""
    # Create three matrices that can be chained
    np.random.seed(202)
    A = np.random.rand(100, 150)
    B = np.random.rand(150, 120)
    C = np.random.rand(120, 80)

    # Compute (A*B)*C
    codeflash_output = numpy_matmul(A, B)
    AB = codeflash_output  # 707ms -> 10.6ms (6583% faster)
    codeflash_output = numpy_matmul(AB, C)
    result = codeflash_output  # 374ms -> 6.96ms (5278% faster)

    # Verify against numpy's matmul
    expected = np.matmul(np.matmul(A, B), C)


def test_large_scale_vector_operations():
    """Test large scale vector operations (outer product)."""
    # Create large vectors
    np.random.seed(303)
    A = np.random.rand(500, 1)  # Column vector
    B = np.random.rand(1, 500)  # Row vector

    # Perform multiplication (outer product)
    codeflash_output = numpy_matmul(A, B)
    result = codeflash_output  # 110ms -> 206ms (46.3% slower)

    # Verify against numpy's matmul
    expected = np.matmul(A, B)


def test_large_scale_with_integers():
    """Test large scale multiplication with integer matrices."""
    # Create large integer matrices
    np.random.seed(404)
    A = np.random.randint(0, 10, size=(200, 150))
    B = np.random.randint(0, 10, size=(150, 180))

    # Perform multiplication
    codeflash_output = numpy_matmul(A.astype(float), B.astype(float))
    result = codeflash_output  # 2.13s -> 32.0ms (6537% faster)

    # Verify against numpy's matmul
    expected = np.matmul(A, B)


# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from src.numerical.linear_algebra import numpy_matmul

To edit these changes git checkout codeflash/optimize-numpy_matmul-mjsd18s0 and push.

The optimized code achieves a **~60x speedup** by replacing the innermost loop with NumPy's `np.dot()` function. This is a critical optimization because: **What Changed:** - **Original**: Triple nested loop with element-wise operations: `result[i, j] += A[i, k] * B[k, j]` - **Optimized**: Two nested loops with vectorized dot product: `result[i, j] = np.dot(A[i, :], B[:, j])` **Why It's Faster:** 1. **Eliminates ~50 million Python loop iterations**: The innermost loop (accounting for 30% of runtime in profiling) is removed entirely, replaced by a single vectorized operation 2. **Leverages optimized BLAS libraries**: `np.dot()` calls highly optimized low-level linear algebra routines (BLAS) written in C/Fortran, which use SIMD instructions and cache-efficient algorithms 3. **Reduces array indexing overhead**: Instead of ~50 million individual array accesses (`A[i, k] * B[k, j]`), the code performs ~700K dot products on array slices, dramatically reducing Python interpreter overhead 4. **Better memory access patterns**: Vectorized operations have better cache locality compared to scattered element-wise access **Performance Characteristics from Tests:** - **Small matrices (2x2, 3x3)**: Mixed results (some 10-30% slower) due to function call overhead dominating for tiny workloads - **Medium matrices (100x100)**: **4411% faster** - sweet spot where vectorization overhead is amortized - **Large matrices (500x200 * 200x300)**: **8683% faster** - massive gains as BLAS optimizations fully activate - **Sparse matrices**: **12497% faster** - vectorized operations handle zeros efficiently without branching - **Vector operations (1x500 * 500x1)**: **5904% faster** - dot products are optimal for this pattern **Trade-offs:** - Slightly slower for very small matrices (1x1, small 2x2) where function call overhead exceeds loop savings - Minor slowdown for outer product patterns (column × row vectors) where the original loop structure was more natural The optimization is highly effective for real-world matrix operations (typically involving matrices >10x10), making it suitable for numerical computing, machine learning, and scientific applications where matrix multiplication is in performance-critical paths.

codeflash-ai bot requested a review from KRRT7 December 30, 2025 09:03

codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Dec 30, 2025

KRRT7 closed this Jan 25, 2026

KRRT7 deleted the codeflash/optimize-numpy_matmul-mjsd18s0 branch January 25, 2026 09:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡️ Speed up function `numpy_matmul` by 5,996%#239

⚡️ Speed up function `numpy_matmul` by 5,996%#239
codeflash-ai[bot] wants to merge 1 commit intomainfrom
codeflash/optimize-numpy_matmul-mjsd18s0

codeflash-ai bot commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

codeflash-ai bot commented Dec 30, 2025

📄 5,996% (59.96x) speedup for numpy_matmul in src/numerical/linear_algebra.py

📝 Explanation and details

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

📄 5,996% (59.96x) speedup for `numpy_matmul` in `src/numerical/linear_algebra.py`