Cpu fused kernel #1804

jiqing-feng · 2025-11-14T01:46:57Z

The fused kernel optimized 4bit model inference about 4x speed-up on TPOT compared dequant+matmul. For next optimization of TTFT, we need to import libxsmm.

Signed-off-by: jiqing-feng <[email protected]>

CMakeLists.txt

SunMarc

Left a comment !

bitsandbytes/nn/modules.py

jiqing-feng · 2025-11-20T04:32:10Z

Hi @matthewdouglas . The BNB will only load 1 lib (one from cpu/cuda/xpu). It means we can only build 1 .so file for the bnb. But we cannot build CPU and XPU together, because CPU relies on openMP(libiomp5.so) but XPU relies on GNU OpenMP (libgomp.so), build them together will raise error like: libbitsandbytes_xpu.so: undefined symbol: __kmpc_for_static_init_8. I suppose it's same for cuda. But without OMP, the CPU kernel might be even worse than python op, and there might be other incompatible flags across different backends.

In the current stage, we can only consider to build one backend, so the format cpu will not be triggered in other backends. Even though, I added the reverse logic in case we want to support multi-backends in the future.

cc @SunMarc

Signed-off-by: jiqing-feng <[email protected]>

matthewdouglas · 2025-11-21T22:29:39Z

Hi @jiqing-feng

If you rebase it should trigger CI to run the tests on the PR now.

In regard to building the optimized CPU code when you're using an accelerator, you're right, I suppose has_avx512() would return false regardless of your CPU since we're only building that in the CPU-only lib? In the future I might want to change things so we always use a separate CPU library build instead of including CPU code in the accelerator libs. For now, it's a good point, so we shouldn't be rearranging weights if you're using an accelerator at all.

I'll look in more detail next week but another question I have is the naming of the gemv. It seems like this is actually a full GEMM implementation? I realize I never made a custom op for full GEMM so that's something maybe I can do in a followup PR.

jiqing-feng · 2025-11-24T05:58:16Z

Yes, exactly! We should never rearrange weight if we are using an accelerator; it's for CPU-only.
Yes, the cpu gemv kernel contains a full gemm implementation because it's a fused kernel. We do dequant and gemm in one kernel, that's where the speed-up is from.

OK, would like to see your next round review. It would be better to give me an approximate time when we can merge this PR so I can plan for the next feature. Thanks!

jiqing-feng · 2025-11-25T03:16:58Z

Hi @matthewdouglas . Please trigger the CI. Thanks!

Signed-off-by: jiqing-feng <[email protected]>

bitsandbytes/nn/modules.py

jiqing-feng · 2025-11-26T02:30:40Z

Hi @matthewdouglas . I've fixed and verified the save and re load. You can check it by the following script.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import shutil
import os

model_id = "hugging-quants/Meta-Llama-3.1-8B-Instruct-BNB-NF4"
save_path = "./local_test_model"
input_text = "What is bitsandbytes?"

if os.path.exists(save_path):
    shutil.rmtree(save_path)

def run_generation(model, tokenizer, prompt, tag):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    print(f"\n[{tag}] Start Generating...")
    with torch.no_grad():
        outputs = model.generate(
            **inputs, 
            max_new_tokens=10, 
            do_sample=False,
            pad_token_id=tokenizer.eos_token_id
        )
    
    output_str = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"[{tag}] Output: {output_str}")
print(">>> Loading original model on CPU...")
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
)
run_generation(model, tokenizer, input_text, "Run-1")
print(f"\n>>> Saving model to {save_path}...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
del model
del tokenizer
import gc
gc.collect()
print(f"\n>>> Reloading model from {save_path}...")
loaded_model = AutoModelForCausalLM.from_pretrained(
    save_path,
    device_map="cpu",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)

run_generation(loaded_model, loaded_tokenizer, input_text, "Run-2")

Output:

>>> Loading original model on CPU...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 35.76it/s]

[Run-1] Start Generating...
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
[Run-1] Output: What is bitsandbytes? Bitsandbytes is a free online tool that allows

>>> Saving model to ./local_test_model...

>>> Reloading model from ./local_test_model...
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 35.76it/s]

[Run-2] Start Generating...
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
[Run-2] Output: What is bitsandbytes? Bitsandbytes is a free online tool that allows

Signed-off-by: jiqing-feng <[email protected]>

matthewdouglas · 2025-11-26T16:20:39Z

Looks good, thanks!

jiqing-feng added 30 commits October 28, 2025 15:02

add template to support more dtypes

6be1412

Signed-off-by: jiqing-feng <[email protected]>

update cmake list

252ac0f

Signed-off-by: jiqing-feng <[email protected]>

fix typo

f98c9e5

Signed-off-by: jiqing-feng <[email protected]>

fix compile cpu

902bf35

Signed-off-by: jiqing-feng <[email protected]>

make different dtype works

fef8459

Signed-off-by: jiqing-feng <[email protected]>

use bf16 on CPU

55cbaa0

Signed-off-by: jiqing-feng <[email protected]>

fix state2 dtype

bbef95b

Signed-off-by: jiqing-feng <[email protected]>

remove torch

e842513

Signed-off-by: jiqing-feng <[email protected]>

rm torch

d4473fa

Signed-off-by: jiqing-feng <[email protected]>

enable float to bf16

dea8dd6

Signed-off-by: jiqing-feng <[email protected]>

rm dequantizeBlockwise4bitCpu

e9bb4fe

Signed-off-by: jiqing-feng <[email protected]>

fix check

cdc8d5e

Signed-off-by: jiqing-feng <[email protected]>

enable dequant 4bit kernel

baacfac

Signed-off-by: jiqing-feng <[email protected]>

fix typo

eec3521

Signed-off-by: jiqing-feng <[email protected]>

fix typo

d7cc1c5

Signed-off-by: jiqing-feng <[email protected]>

fix dequantize

124b754

Signed-off-by: jiqing-feng <[email protected]>

fix

0f918c7

Signed-off-by: jiqing-feng <[email protected]>

fix

e1a8b20

Signed-off-by: jiqing-feng <[email protected]>

test

eab45c8

Signed-off-by: jiqing-feng <[email protected]>

fix

d9f5dd8

Signed-off-by: jiqing-feng <[email protected]>

fix

070f8a0

Signed-off-by: jiqing-feng <[email protected]>

fix

a84addf

Signed-off-by: jiqing-feng <[email protected]>

fix

c4bb660

Signed-off-by: jiqing-feng <[email protected]>

fix

4ba13fd

Signed-off-by: jiqing-feng <[email protected]>

change input param

c0d05ec

Signed-off-by: jiqing-feng <[email protected]>

fix typo

62a16a6

Signed-off-by: jiqing-feng <[email protected]>

fix input param

d9ad828

Signed-off-by: jiqing-feng <[email protected]>

spliut 8bit and 4bit

09ed6cb

Signed-off-by: jiqing-feng <[email protected]>

fix typo

a3f7b61

Signed-off-by: jiqing-feng <[email protected]>

fix typo

4708470

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng added 2 commits November 19, 2025 15:47

fix out shape

57b89bf

Signed-off-by: jiqing-feng <[email protected]>

Merge branch 'main' into cpu_fused_kernel

302a5fe

matthewdouglas reviewed Nov 19, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

SunMarc reviewed Nov 19, 2025

View reviewed changes

bitsandbytes/nn/modules.py Show resolved Hide resolved

jiqing-feng added 9 commits November 20, 2025 09:13

fix comments

de5fb9c

Signed-off-by: jiqing-feng <[email protected]>

add reverse format

6858a90

Signed-off-by: jiqing-feng <[email protected]>

check avx512bf15

3b3d609

Signed-off-by: jiqing-feng <[email protected]>

fix has_avx512bf16

fbb911b

Signed-off-by: jiqing-feng <[email protected]>

fix tests

3179b42

Signed-off-by: jiqing-feng <[email protected]>

fix absmax shhape

0c88d43

Signed-off-by: jiqing-feng <[email protected]>

fix compile

feb8ad2

Signed-off-by: jiqing-feng <[email protected]>

fix tests

c6b714d

Signed-off-by: jiqing-feng <[email protected]>

fix test_gemv

5497111

Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng mentioned this pull request Nov 21, 2025

INT4 W4A16 kernel for AWQ mingfeima/sglang#57

Merged

6 tasks

Merge branch 'main' into cpu_fused_kernel

0045c4b

jiqing-feng force-pushed the cpu_fused_kernel branch from d2de0f5 to 0045c4b Compare November 25, 2025 05:41

jiqing-feng added 2 commits November 25, 2025 13:28

disable binsearch

bdb25c0

Signed-off-by: jiqing-feng <[email protected]>

fix lint

6cec12d

Signed-off-by: jiqing-feng <[email protected]>

matthewdouglas added this to the v0.49.0 milestone Nov 25, 2025

matthewdouglas reviewed Nov 25, 2025

View reviewed changes

bitsandbytes/nn/modules.py Show resolved Hide resolved

matthewdouglas reviewed Nov 25, 2025

View reviewed changes

bitsandbytes/nn/modules.py Show resolved Hide resolved

fix save

692a8e1

Signed-off-by: jiqing-feng <[email protected]>

matthewdouglas merged commit 6aa9619 into bitsandbytes-foundation:main Nov 26, 2025
129 of 132 checks passed

matthewdouglas mentioned this pull request Dec 2, 2025

Cleanup: remove FastBinarySearch #1817

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cpu fused kernel #1804

Cpu fused kernel #1804

Uh oh!

jiqing-feng commented Nov 14, 2025

Uh oh!

Uh oh!

SunMarc left a comment

Uh oh!

Uh oh!

jiqing-feng commented Nov 20, 2025 •

edited

Loading

Uh oh!

matthewdouglas commented Nov 21, 2025

Uh oh!

jiqing-feng commented Nov 24, 2025

Uh oh!

jiqing-feng commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Nov 26, 2025

Uh oh!

matthewdouglas commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Cpu fused kernel #1804

Cpu fused kernel #1804

Uh oh!

Conversation

jiqing-feng commented Nov 14, 2025

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiqing-feng commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matthewdouglas commented Nov 21, 2025

Uh oh!

jiqing-feng commented Nov 24, 2025

Uh oh!

jiqing-feng commented Nov 25, 2025

Uh oh!

Uh oh!

Uh oh!

jiqing-feng commented Nov 26, 2025

Uh oh!

matthewdouglas commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiqing-feng commented Nov 20, 2025 •

edited

Loading