Skip to content

Releases: bitsandbytes-foundation/bitsandbytes

0.46.1

02 Jul 19:45
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.46.0...0.46.1

Latest `main` wheel

01 Jul 19:17
ed9c8fc
Compare
Choose a tag to compare
Latest `main` wheel Pre-release
Pre-release

Latest main pre-release wheel

This pre-release contains the latest development wheels for all supported platforms, rebuilt automatically on every commit to the main branch.

How to install:
Pick the correct command for your platform and run it in your terminal:

Linux (ARM/aarch64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl

Linux (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl

Windows (x86_64)

pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-win_amd64.whl

Note:
These wheels are updated automatically with every commit to main and become available as soon as the python-package.yml workflow finishes.

The version number is replaced with 1.33.7-preview in order to keep the link stable, this however does not affect the installed version at all:

> pip install https://.../bitsandbytes-1.33.7-preview-py3-none-manylinux_2_24_x86_64.whl
Collecting bitsandbytes==1.33.7rc0
...
Successfully installed bitsandbytes-0.46.0.dev0

0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels

27 May 21:27
Compare
Choose a tag to compare

Highlights

  • Support for torch.compile without graph breaks for LLM.int8().
    • Compatible with PyTorch 2.4+, but PyTorch 2.6+ is recommended.
    • Experimental CPU support is included.
  • Support torch.compile without graph breaks for 4bit.
    • Compatible with PyTorch 2.4+ for fullgraph=False.
    • Requires PyTorch 2.8 nightly for fullgraph=True.
  • We are now publishing wheels for CUDA Linux aarch64 (sbsa)!
    • Targets are Turing generation and newer: sm75, sm80, sm90, and sm100.
  • PyTorch Custom Operators refactoring and integration:
    • We have refactored most of the library code to integrate better with PyTorch via the torch.library and custom ops APIs. This helps enable our torch.compile and additional hardware compatibility efforts.
    • End-users do not need to change the way they are using bitsandbytes.
  • Unit tests have been cleaned up for increased determinism and most are now device-agnostic.
    • A new nightly CI runs unit tests for CPU (Windows x86-64, Linux x86-64/aarch64) and CUDA (Linux/Windows x86-64).

Compatability Changes

  • Support for Python 3.8 is dropped.
  • Support for PyTorch < 2.2.0 is dropped.
  • CUDA 12.6 and 12.8 builds are now compatible for manylinux_2_24 (previously manylinux_2_34).
  • Many APIs that were previously marked as deprecated have now been removed.
  • New deprecations:
    • bnb.autograd.get_inverse_transform_indices()
    • bnb.autograd.undo_layout()
    • bnb.functional.create_quantile_map()
    • bnb.functional.estimate_quantiles()
    • bnb.functional.get_colrow_absmax()
    • bnb.functional.get_row_absmax()
    • bnb.functional.histogram_scatter_add_2d()

What's Changed

New Contributors

Full Changelog: 0.45.4...0.46.0

Multi-Backend Preview

19 May 13:24
5e267f5
Compare
Choose a tag to compare
Multi-Backend Preview Pre-release
Pre-release
continuous-release_multi-backend-refactor

update compute_type_is_set attr (#1623)

0.45.5

07 Apr 13:37
Compare
Choose a tag to compare

This is a minor release that affects CPU-only usage of bitsandbytes. The CPU build of the library was inadvertently omitted from the v0.45.4 wheels.

Full Changelog: 0.45.4...0.45.5

0.45.4

25 Mar 16:41
Compare
Choose a tag to compare

This is a minor release that affects CPU-only usage of bitsandbytes. There is one bugfix and improved system compatibility on Linux.

What's Changed

  • Build: use ubuntu-22.04 instead of 24.04 for CPU build (glibc compat) by @matthewdouglas in #1538
  • Fix CPU dequantization to use nested dequantized scaling constant by @zyklotomic in #1549

New Contributors

Full Changelog: 0.45.3...0.45.4

0.45.3

24 Feb 19:50
Compare
Choose a tag to compare

Overview

This is a small patch release containing a few bug fixes.

Additionally, this release contains a CUDA 12.8 build which adds the sm100 and sm120 targets for NVIDIA Blackwell GPUs.

What's Changed

New Contributors

Full Changelog: 0.45.2...0.45.3

0.45.2

06 Feb 19:03
Compare
Choose a tag to compare

This patch release fixes a compatibility issue with Triton 3.2 in PyTorch 2.6. When importing bitsandbytes without any GPUs visible in an environment with Triton installed, a RuntimeError may be raised:

RuntimeError: 0 active drivers ([]). There should only be one.

Full Changelog: 0.45.1...0.45.2

0.45.1

23 Jan 16:22
Compare
Choose a tag to compare

Overview

This is a patch release containing compatibility fixes.

Highlights

  • Compatibility for triton>=3.2.0
  • Moved package configuration to pyproject.toml
  • Build system: initial support for NVIDIA Blackwell B100 GPUs, RTX 50 Blackwell series GPUs and Jetson Thor Blackwell.
    • Note: Binaries built for these platforms are not included in this release. They will be included in future releases upon the availability of the upcoming CUDA Toolkit 12.7 and 12.8.
  • Packaging: wheels will no longer include unit tests. (#1478)
  • Sets the minimum PyTorch version to 2.0.0.

What's Changed

New Contributors

Full Changelog: 0.45.0...0.45.1

0.45.0: LLM.int8() support for H100; faster 4-bit/8-bit inference

05 Dec 16:07
Compare
Choose a tag to compare

Highlights

H100 Support for LLM.int8()

PR #1401 brings full LLM.int8() support for NVIDIA Hopper GPUs such as the H100, H200, and H800!

As part of the compatibility enhancements, we've rebuilt much of the LLM.int8() code in order to simplify for future compatibility and maintenance. We no longer use the col32 or architecture-specific tensor layout formats while maintaining backwards compatibility. We additionally bring performance improvements targeted for inference scenarios.

Performance Improvements

This release includes broad performance improvements for a wide variety of inference scenarios. See this X thread for a detailed explanation.

The improvements were measured using the 🤗optimum-benchmark tool.

For more benchmark results, see benchmarking/README.md.

LLM.int8()

  • Turing/Ampere/Ada: The observed per-token throughput is improved by 60-85%, while latency is decreased by 40-45%.
  • H100: With our benchmarking of Llama 3.1 70B, we observed the new LLM.int8() to consistently outperform NF4 at batch size >= 8.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 9.05 tokens/s => 15.44 tokens/s
  • Batch size = 8: 66.62 tokens/s => 110.95 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 3.34 tokens/s => 5.98 tokens/s
  • Batch size = 8: 24.28 tokens/s => 44.15 tokens/s

NF4/FP4

  • Turing/Ampere/Ada: With batch size of 1, per-token throughput is improved by 10-25% and per-token latency is decreased by 10-20%.
  • H100: Across all batch sizes, per-token throughput is improved by up to 28% and per-token latency is decreased by up to 22%.

Example throughput improvement for Qwen 2.5 14B Instruct on RTX 4090:

  • Batch size = 1: 31.46 tokens/s => 39.03 tokens/s
  • Batch size = 8: 110.70 tokens/s => 111.29 tokens/s

Example throughput improvement for Qwen 2.5 3B Instruct on T4:

  • Batch size = 1: 11.05 tokens/s => 13.58 tokens/s
  • Batch size = 8: 69.8 tokens/s => 76.80 tokens/s

Changes

Packaging Changes

The size of our wheel has been reduced by ~43.5% from 122.4 MB to 69.1 MB! This results in an on-disk size decrease from ~396MB to ~224MB.

CUDA Toolkit Versions

  • Binaries built with CUDA Toolkit 12.6.2 are now included in the PyPI distribution.
  • The CUDA 12.5.0 build has been updated to CUDA Toolkit 12.5.1.

Breaking

🤗PEFT users wishing to merge adapters with 8-bit weights will need to upgrade to peft>=0.14.0.

New

Deprecations

A number of public API functions have been marked for deprecation and will emit FutureWarning when used. These functions will become unavailable in future releases. This should have minimal impact on most end-users.

k-bit quantization

The k-bit quantization features are deprecated in favor of blockwise quantization. For all optimizers, using block_wise=False is not recommended and support will be removed in a future release.

LLM.int8() deprecations:

As part of the refactoring process, we've implemented many new 8bit operations. These operations no longer use specialized data layouts.

The following relevant functions from bitsandbytes.functional are now deprecated :

  • dequant_min_max
  • dequantize_no_absmax
  • extract_outliers
  • get_special_format_str
  • get_transform_buffer
  • get_transform_func
  • mm_dequant (replacement: int8_mm_dequant)
  • igemmlt (replacement: int8_linear_matmul)
  • nvidia_transform
  • transform
  • quantize_no_absmax
  • vectorwise_dequant
  • vectorwise_quant (~replacement: int8_vectorwise_quant)
  • vectorwise_mm_dequant (~replacement: int8_mm_dequant)

General Deprecations

Additionally the following functions from bitsandbytes.functional are deprecated:

  • _mul
  • arange
  • post_call
  • pre_call

What's Changed

New Contributors

Full Changelog: 0.44.1...0.45.0