Releases · bitsandbytes-foundation/bitsandbytes

02 Jul 19:45

matthewdouglas

0.46.1

4bca844

0.46.1 Latest

Latest

What's Changed

Fix params4bit passing bnb quantized by @mklabunde in #1665
Improvement for torch.compile support on Params4bit by @matthewdouglas in #1673
doc fix signature for 8-bit optim by @ved1beta in #1660
Fix AdamW documentation by @agupta2304 in #1686
Make minor improvements to optimizer.py by @agupta2304 in #1687
Add CUDA 12.9 build by @matthewdouglas in #1689
Automatically call CMake as part of PEP 517 build by @mgorny in #1512

New Contributors

@mklabunde made their first contribution in #1665
@agupta2304 made their first contribution in #1686
@mgorny made their first contribution in #1512

Full Changelog: 0.46.0...0.46.1

Contributors

mgorny, mklabunde, and 3 other contributors

Assets 2

01 Jul 19:17

github-actions

continuous-release_main

ed9c8fc

Latest `main` wheel Pre-release

Pre-release

Latest `main` pre-release wheel

This pre-release contains the latest development wheels for all supported platforms, rebuilt automatically on every commit to the main branch.

How to install:
Pick the correct command for your platform and run it in your terminal:

Linux (ARM/aarch64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl

Linux (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl

Windows (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-win_amd64.whl

Note:
These wheels are updated automatically with every commit to main and become available as soon as the python-package.yml workflow finishes.

The version number is replaced with 1.33.7-preview in order to keep the link stable, this however does not affect the installed version at all:

> pip install https://.../bitsandbytes-1.33.7-preview-py3-none-manylinux_2_24_x86_64.whl
Collecting bitsandbytes==1.33.7rc0
...
Successfully installed bitsandbytes-0.46.0.dev0

Assets 5

27 May 21:27

matthewdouglas

0.46.0

1e54f91

0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels

Highlights

Support for torch.compile without graph breaks for LLM.int8().
- Compatible with PyTorch 2.4+, but PyTorch 2.6+ is recommended.
- Experimental CPU support is included.
Support torch.compile without graph breaks for 4bit.
- Compatible with PyTorch 2.4+ for fullgraph=False.
- Requires PyTorch 2.8 nightly for fullgraph=True.
We are now publishing wheels for CUDA Linux aarch64 (sbsa)!
- Targets are Turing generation and newer: sm75, sm80, sm90, and sm100.
PyTorch Custom Operators refactoring and integration:
- We have refactored most of the library code to integrate better with PyTorch via the torch.library and custom ops APIs. This helps enable our torch.compile and additional hardware compatibility efforts.
- End-users do not need to change the way they are using bitsandbytes.
Unit tests have been cleaned up for increased determinism and most are now device-agnostic.
- A new nightly CI runs unit tests for CPU (Windows x86-64, Linux x86-64/aarch64) and CUDA (Linux/Windows x86-64).

Compatability Changes

Support for Python 3.8 is dropped.
Support for PyTorch < 2.2.0 is dropped.
CUDA 12.6 and 12.8 builds are now compatible for manylinux_2_24 (previously manylinux_2_34).
Many APIs that were previously marked as deprecated have now been removed.
New deprecations:
- bnb.autograd.get_inverse_transform_indices()
- bnb.autograd.undo_layout()
- bnb.functional.create_quantile_map()
- bnb.functional.estimate_quantiles()
- bnb.functional.get_colrow_absmax()
- bnb.functional.get_row_absmax()
- bnb.functional.histogram_scatter_add_2d()

What's Changed

PyTorch Custom Operator Integration by @matthewdouglas in #1544
Bump CUDA 12.8.0 build to CUDA 12.8.1 by @matthewdouglas in #1575
Drop Python 3.8 support. by @matthewdouglas in #1574
Test cleanup by @matthewdouglas in #1576
Fix: Return tuple in get_cuda_version_tuple by @DevKimbob in #1580
Fix torch.compile issue for LLM.int8() with threshold=0 by @matthewdouglas in #1581
fix for missing cpu lib by @Titus-von-Koeller in #1585
Fix #1588 - torch compatability for <=2.4 by @matthewdouglas in #1590
Add autoloading for backend packages by @matthewdouglas in #1593
Specify blocksize by @cyr0930 in #1586
fix typo getitem by @ved1beta in #1597
fix: Improve CUDA version detection and error handling by @ved1beta in #1599
Support LLM.int8() inference with torch.compile by @matthewdouglas in #1594
Updates for device agnosticism by @matthewdouglas in #1601
Stop building for CUDA toolkit < 11.8 by @matthewdouglas in #1605
fix intel cpu/xpu installation by @jiqing-feng in #1613
Support 4bit torch.compile fullgraph with PyTorch nightly by @matthewdouglas in #1616
Improve torch.compile support for int8 with torch>=2.8 nightly by @matthewdouglas in #1617
Add simple op implementations for CPU by @matthewdouglas in #1602
Set up nightly CI for unit tests by @matthewdouglas in #1619
point to correct latest continuous release main by @winglian in #1621
ARM runners (faster than cross compilation qemu) by @johnnynunez in #1539
Linux aarch64 CI updates by @matthewdouglas in #1622
Moved int8_mm_dequant from CPU to default backend by @Egor-Krivov in #1626
Refresh content for README.md by @matthewdouglas in #1620
C lib loading: add fallback with sensible error msg by @Titus-von-Koeller in #1615
Switch CUDA builds to use Rocky Linux 8 container by @matthewdouglas in #1638
Improvements to test suite by @matthewdouglas in #1636
Additional CI runners by @matthewdouglas in #1639
CI runner updates by @matthewdouglas in #1643
Optimizer backwards compatibility fix by @matthewdouglas in #1647
General cleanup & test improvements by @matthewdouglas in #1646
Add torch.compile tests by @matthewdouglas in #1648
Documentation Cleanup by @matthewdouglas in #1644
simplified non_sign_bits by @ved1beta in #1649

New Contributors

@DevKimbob made their first contribution in #1580
@cyr0930 made their first contribution in #1586
@ved1beta made their first contribution in #1597
@winglian made their first contribution in #1621
@Egor-Krivov made their first contribution in #1626

Full Changelog: 0.45.4...0.46.0

Contributors

winglian, Egor-Krivov, and 7 other contributors

Assets 2

19 May 13:24

github-actions

continuous-release_multi-backend-refactor

5e267f5

Multi-Backend Preview Pre-release

Pre-release

continuous-release_multi-backend-refactor

update compute_type_is_set attr (#1623)

Assets 4

07 Apr 13:37

matthewdouglas

0.45.5

d72cb9c

0.45.5

This is a minor release that affects CPU-only usage of bitsandbytes. The CPU build of the library was inadvertently omitted from the v0.45.4 wheels.

Full Changelog: 0.45.4...0.45.5

Assets 2

25 Mar 16:41

matthewdouglas

0.45.4

f0735f9

0.45.4

This is a minor release that affects CPU-only usage of bitsandbytes. There is one bugfix and improved system compatibility on Linux.

What's Changed

Build: use ubuntu-22.04 instead of 24.04 for CPU build (glibc compat) by @matthewdouglas in #1538
Fix CPU dequantization to use nested dequantized scaling constant by @zyklotomic in #1549

New Contributors

@zyklotomic made their first contribution in #1549

Full Changelog: 0.45.3...0.45.4

Contributors

zyklotomic and matthewdouglas

Assets 2

24 Feb 19:50

matthewdouglas

0.45.3

efc14c1

0.45.3

Overview

This is a small patch release containing a few bug fixes.

Additionally, this release contains a CUDA 12.8 build which adds the sm100 and sm120 targets for NVIDIA Blackwell GPUs.

What's Changed

Fix #1490 by @matthewdouglas in #1496
Blackwell binaries! by @johnnynunez in #1491
Bug fix: Update create_dynamic_map to always return a float32 tensor by @mitchellgoffpc in #1521
Update cuda versions in error messages by @FxMorin in #1520
QuantState.to(): move code tensor with others to correct device by @matthewdouglas in #1528
Installation doc updates by @matthewdouglas in #1529

New Contributors

@mitchellgoffpc made their first contribution in #1521
@FxMorin made their first contribution in #1520

Full Changelog: 0.45.2...0.45.3

Contributors

mitchellgoffpc, johnnynunez, and 2 other contributors

Assets 2

06 Feb 19:03

matthewdouglas

0.45.2

7aec4a8

0.45.2

This patch release fixes a compatibility issue with Triton 3.2 in PyTorch 2.6. When importing bitsandbytes without any GPUs visible in an environment with Triton installed, a RuntimeError may be raised:

RuntimeError: 0 active drivers ([]). There should only be one.

Full Changelog: 0.45.1...0.45.2

Assets 2

23 Jan 16:22

matthewdouglas

0.45.1

8cd7793

0.45.1

Overview

This is a patch release containing compatibility fixes.

Highlights

Compatibility for triton>=3.2.0
Moved package configuration to pyproject.toml
Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell.
- Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8.
Packaging: wheels will no longer include unit tests. (#1478)
Sets the minimum PyTorch version to 2.0.0.

What's Changed

Add installation doc for bnb on Ascend NPU by @ji-huazhong in #1442
(chore) Remove unused dotfiles by @matthewdouglas in #1445
Remove triton.ops, copy necessary bits here by @bertmaher in #1413
chore: migrate config files to pyproject.toml by @SauravMaheshkar in #1373
cleanup: remove unused kernels/C++ code by @matthewdouglas in #1458
(Deps) Require torch 2.x and minor updates by @matthewdouglas in #1459
FSDP-QLoRA doc updates for TRL integration by @blbadger in #1471
Initial support blackwell by @johnnynunez in #1481
(build) include Ada/Hopper targets in cu118 build by @matthewdouglas in #1487
Exclude tests from distribution by @akx in #1486

New Contributors

@bertmaher made their first contribution in #1413
@SauravMaheshkar made their first contribution in #1373
@blbadger made their first contribution in #1471
@johnnynunez made their first contribution in #1481

Full Changelog: 0.45.0...0.45.1

Contributors

akx, bertmaher, and 5 other contributors

Assets 2

05 Dec 16:07

matthewdouglas

0.45.0

64d382d

0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference

Highlights

H100 Support for LLM.int8()

PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

Performance Improvements

This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.

The improvements were measured using the 🤗optimum-benchmark tool.

For more benchmark results, see benchmarking/README.md.

LLM.int8()

Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
Batch size = 8: 66.62 tokens/s => 110.95 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
Batch size = 8: 24.28 tokens/s => 44.15 tokens/s

NF4/FP4

Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
Batch size = 8: 110.70 tokens/s => 111.29 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
Batch size = 8: 69.8 tokens/s => 76.80 tokens/s

Changes

Packaging Changes

The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.

CUDA Toolkit Versions

Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.

Breaking

🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0.

New

A new public API for int8 dequantization has been added: bitsandbytes.functional.int8_vectorwise_dequant(). This functionality is being integrated into 🤗PEFT and 🤗transformers.
We've continued to make documentation updates. The bitsandbytes.functional module now has an API documentation page.

Deprecations

A number of public API functions have been marked for deprecation and will emit FutureWarning when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.

k-bit quantization

The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False is not recommended and support will be removed in a future release.

LLM.int8() deprecations:

As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

The following relevant functions from bitsandbytes.functional are now deprecated :

dequant_min_max
dequantize_no_absmax
extract_outliers
get_special_format_str
get_transform_buffer
get_transform_func
mm_dequant (replacement: int8_mm_dequant)
igemmlt (replacement: int8_linear_matmul)
nvidia_transform
transform
quantize_no_absmax
vectorwise_dequant
vectorwise_quant (~replacement: int8_vectorwise_quant)
vectorwise_mm_dequant (~replacement: int8_mm_dequant)

General Deprecations

Additionally the following functions from bitsandbytes.functional are deprecated:

_mul
arange
post_call
pre_call

What's Changed

refine docs for multi-backend alpha release by @Titus-von-Koeller in #1380
README: Replace special Unicode text symbols with regular characters by @akx in #1385
Update CI tools & fix typos by @akx in #1386
Fix invalid escape sequence warning in Python 3.12 by @oshiteku in #1420
[Build] Add CUDA 12.6.2 build; update 12.5.0 to 12.5.1 by @matthewdouglas in #1431
LLM.int8() Refactoring: Part 1 by @matthewdouglas in #1401

New Contributors

@oshiteku made their first contribution in #1420

Full Changelog: 0.44.1...0.45.0

Contributors

akx, Titus-von-Koeller, and 2 other contributors

Assets 2

Releases: bitsandbytes-foundation/bitsandbytes

0.46.1

What's Changed

New Contributors

Contributors

Uh oh!

Latest `main` wheel

Latest main pre-release wheel

Linux (ARM/aarch64)

Linux (x86_64)

Windows (x86_64)

Uh oh!

0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels

Highlights

Compatability Changes

What's Changed

New Contributors

Contributors

Uh oh!

Multi-Backend Preview

Uh oh!

0.45.5

Uh oh!

0.45.4

What's Changed

New Contributors

Contributors

Uh oh!

0.45.3

Overview

What's Changed

New Contributors

Contributors

Uh oh!

0.45.2

Uh oh!

0.45.1

Overview

Highlights

What's Changed

New Contributors

Contributors

Uh oh!

0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference

Highlights

H100 Support for LLM.int8()

Performance Improvements

LLM.int8()

NF4/FP4

Changes

Packaging Changes

CUDA Toolkit Versions

Breaking

New

Deprecations

k-bit quantization

LLM.int8() deprecations:

General Deprecations

What's Changed

New Contributors

Contributors

Uh oh!

Latest `main` pre-release wheel