Skip to content

Commit 408dc6c

Browse files
committed
4-bit beta initial.
0 parents  commit 408dc6c

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+20924
-0
lines changed

.buckconfig

Whitespace-only changes.

.gitignore

+135
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
pip-wheel-metadata/
24+
share/python-wheels/
25+
*.egg-info/
26+
.installed.cfg
27+
*.egg
28+
MANIFEST
29+
30+
# PyInstaller
31+
# Usually these files are written by a python script from a template
32+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
33+
*.manifest
34+
*.spec
35+
36+
# Installer logs
37+
pip-log.txt
38+
pip-delete-this-directory.txt
39+
40+
# Unit test / coverage reports
41+
htmlcov/
42+
.tox/
43+
.nox/
44+
.coverage
45+
.coverage.*
46+
.cache
47+
nosetests.xml
48+
coverage.xml
49+
*.cover
50+
*.py,cover
51+
.hypothesis/
52+
.pytest_cache/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
target/
76+
77+
# Jupyter Notebook
78+
.ipynb_checkpoints
79+
80+
# IPython
81+
profile_default/
82+
ipython_config.py
83+
84+
# pyenv
85+
.python-version
86+
87+
# pipenv
88+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
90+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
91+
# install all needed dependencies.
92+
#Pipfile.lock
93+
94+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
95+
__pypackages__/
96+
97+
# Celery stuff
98+
celerybeat-schedule
99+
celerybeat.pid
100+
101+
# SageMath parsed files
102+
*.sage.py
103+
104+
# Environments
105+
.env
106+
.venv
107+
env/
108+
venv/
109+
ENV/
110+
env.bak/
111+
venv.bak/
112+
113+
# Spyder project settings
114+
.spyderproject
115+
.spyproject
116+
117+
# Rope project settings
118+
.ropeproject
119+
120+
# mkdocs documentation
121+
/site
122+
123+
# mypy
124+
.mypy_cache/
125+
.dmypy.json
126+
dmypy.json
127+
128+
# Pyre type checker
129+
.pyre/
130+
131+
# vim
132+
*.swp
133+
134+
dependencies
135+
cuda_build

CHANGELOG.md

+230
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,230 @@
1+
### 0.0.21
2+
- Ampere, RTX 30 series GPUs now compatible with the library.
3+
4+
### 0.0.22:
5+
6+
- Fixed an error where a `reset_parameters()` call on the `StableEmbedding` would lead to an error in older PyTorch versions (from 1.7.0).
7+
8+
### 0.0.23:
9+
10+
Bugs:
11+
- Unified quantization API: each quantization function now returns `Q, S` where `Q` is the quantized tensor and `S` the quantization state which may hold absolute max values, a quantization map or more. For dequantization all functions now accept the inputs `Q, S` so that `Q` is dequantized with the quantization state `S`.
12+
- Fixed an issue where the CUDA 11.1 binary was not compiled with the right headers
13+
14+
API changes:
15+
- Block-wise quantization for optimizers now enabled by default
16+
17+
Features:
18+
- Block-wise quantization routines now support CPU Tensors.
19+
20+
21+
### 0.0.24:
22+
23+
- Fixed a bug where a float/half conversion led to a compilation error for CUDA 11.1 on Turning GPUs.
24+
- removed Apex dependency for bnb LAMB
25+
26+
### 0.0.25:
27+
28+
Features:
29+
- Added `skip_zeros` for block-wise and 32-bit optimizers. This ensures correct updates for sparse gradients and sparse models.
30+
- Added support for Kepler GPUs. (#4)
31+
- Added Analysis Adam to track 8-bit vs 32-bit quantization errors over time.
32+
- Make compilation more user friendly.
33+
34+
Bug fixes:
35+
- fixed "undefined symbol: \_\_fatbinwrap_38" error for P100 GPUs on CUDA 10.1 (#5)
36+
37+
Docs:
38+
- Added docs with instructions to compile from source.
39+
40+
41+
### 0.26.0:
42+
43+
Features:
44+
- Added Adagrad (without grad clipping) as 32-bit and 8-bit block-wise optimizer.
45+
- Added AdamW (copy of Adam with weight decay init 1e-2). #10
46+
- Introduced ModuleConfig overrides which can be seamlessly be used at initialization time of a module.
47+
- Added `bnb.nn.Embedding` layer which runs at 32-bit but without the layernorm. This works well if you need to fine-tune pretrained models that do not have a embedding layer norm. #19
48+
49+
Bug fixes:
50+
- Fixed a bug where weight decay was incorrectly applied to 32-bit Adam. #13
51+
- Fixed an unsafe use of eval. #8
52+
- Fixed a bug where the StableEmbedding layer 32-bit optimizer override would not work without registering the whole model first (`bnb.optim.GlobalOptimManager.get_instance().register_parameters(model.parameters())`). #13 #15
53+
54+
Docs:
55+
- Added instructions how to solve "\_\_fatbinwrap_" errors.
56+
57+
58+
### 0.30.0
59+
60+
#### 8-bit Inference Update
61+
62+
Features:
63+
- Added 8-bit matrix multiplication form cuBLAS, and cuBLASLt as well as multiple GEMM kernels (GEMM, GEMMEx, GEMMLt)
64+
- Added 8-bit Linear layers with 8-bit Params that perform memory efficient inference with an option for 8-bit mixed precision matrix decomposition for inference without performance degradation
65+
- Added quantization methods for "fake" quantization as well as optimized kernels vector-wise quantization and equalization as well as optimized cuBLASLt transformations
66+
- CPU only build now available (Thank you, @mryab)
67+
68+
Deprecated:
69+
- Pre-compiled release for CUDA 9.2, 10.0, 10.2 no longer available
70+
71+
### 0.31.0
72+
73+
#### 8-bit Inference and Packaging Update
74+
75+
Features:
76+
- added direct outlier extraction. This enables outlier extraction without fp16 weights without performance degradation.
77+
- Added automatic CUDA SETUP procedure and packaging all binaries into a single bitsandbytes package.
78+
79+
### 0.32.0
80+
81+
#### 8-bit Inference Performance Enhancements
82+
83+
We added performance enhancements for small models. This makes small models about 2x faster for LLM.int8() inference.
84+
85+
Features:
86+
- Int32 dequantization now supports fused biases.
87+
- Linear8bitLt now uses a fused bias implementation.
88+
- Change `.data.storage().data_ptr()` to `.data.data_ptr()` to enhance inference performance.
89+
90+
Bug fixes:
91+
- Now throws and error if LLM.int8() is used on a GPU that is not supported.
92+
- Enhances error messaging if CUDA SETUP fails.
93+
94+
95+
### 0.33.0
96+
97+
#### Various bug fixes
98+
99+
Features:
100+
- CPU quantization now supports a variable `blocksize` variable to enhance quantization speed or precision.
101+
102+
Bug fixes:
103+
- fixed an issue in CPU quantization where tensors with more than 2^31 elements would fail 19a7adca7a6c9bf7061a384d7e9d9b13676a1a88
104+
- fixed a bug where cpu binaries would fail if no GPU would be detected eab4d8232d558f2e6bd7f7cc3d00e2e6e94f4e80
105+
- fixed an issue where cpu binaries cause additional stdout messages 92a3363096e10ad6a5c4e944af898bd1186d806a
106+
- fixed an import of bnb.utils 2e630b55f51d454f3bd723dffda68a07ef93190c
107+
108+
We thank @mryab, @mbrukman, @chessgecko, @dbaranchuk for pull request with bug fixes and new features.
109+
110+
111+
### 0.34.0
112+
113+
#### Bug fixes and memory efficient backprop
114+
115+
Features:
116+
- Linear8bitLt layer now supports `memory_efficient_backward=True` which enables backprop of gradients through frozen weights.
117+
118+
Bug fixes:
119+
- fixed an issue where too many threads were created in blockwise quantization on the CPU for large tensors
120+
121+
122+
### 0.35.0
123+
124+
#### CUDA 11.8 support and bug fixes
125+
126+
Features:
127+
- CUDA 11.8 support added and binaries added to the PyPI release.
128+
129+
Bug fixes:
130+
- fixed a bug where too long directory names would crash the CUDA SETUP #35 (thank you @tomaarsen)
131+
- fixed a bug where CPU installations on Colab would run into an error #34 (thank you @tomaarsen)
132+
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
133+
134+
### 0.35.1
135+
136+
Features:
137+
- Added CUDA instruction generator to fix some installations.
138+
139+
Bug fixes:
140+
- Fixed a problem where warning messages would be displayed even though everything worked correctly.
141+
142+
### 0.35.2
143+
144+
Bug fixes:
145+
- Fixed a bug where the CUDA setup failed due to a wrong function call.
146+
147+
### 0.35.3
148+
149+
Bug fixes:
150+
- Fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
151+
152+
### 0.35.4
153+
154+
Bug fixes:
155+
- Fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
156+
- Fixed a bug where not finding the cuda runtime led to an incomprehensible error.
157+
158+
159+
### 0.36.0
160+
161+
#### Improvements, Ada/Hopper support, fake k-bit quantization.
162+
163+
Features:
164+
- CUDA 11.8 and 12.0 support added
165+
- support for Ada and Hopper GPUs added (compute capability 8.9 and 9.0)
166+
- support for fake k-bit block-wise quantization for Int, Float, quantile quantization, and dynamic exponent data types added
167+
- Added CUDA instruction generator to fix some installations.
168+
- Added additional block sizes for quantization {64, 128, 256, 512, 1024}
169+
- Added SRAM Quantile algorithm to quickly estimate less than 256 quantiles
170+
- Added option to suppress the bitsandbytes welcome message (@Cyberes)
171+
172+
Regression:
173+
- Compute capability 3.0 removed: GTX 600s and 700s series is no longer supported (except GTX 780 and GTX 780 Ti)
174+
175+
Bug fixes:
176+
- fixed a bug where too long directory names would crash the CUDA SETUP #35 (@tomaarsen)
177+
- fixed a bug where CPU installations on Colab would run into an error #34 (@tomaarsen)
178+
- fixed an issue where the default CUDA version with fast-DreamBooth was not supported #52
179+
- fixed a bug where the CUDA setup failed due to a wrong function call.
180+
- fixed a bug in the CUDA Setup which led to an incomprehensible error if no GPU was detected.
181+
- fixed a bug in the CUDA Setup failed with the cuda runtime was found, but not the cuda library.
182+
- fixed a bug where not finding the cuda runtime led to an incomprehensible error.
183+
- fixed a bug where with missing CUDA the default was an error instead of the loading the CPU library
184+
- fixed a bug where the CC version of the GPU was not detected appropriately (@BlackHC)
185+
- fixed a bug in CPU quantization which lead to errors when the input buffer exceeded 2^31 elements
186+
187+
Improvements:
188+
- multiple improvements in formatting, removal of unused imports, and slight performance improvements (@tomaarsen)
189+
- StableEmbedding layer now has device and dtype parameters to make it 1:1 replaceable with regular Embedding layers (@lostmsu)
190+
- runtime performance of block-wise quantization slightly improved
191+
- added error message for the case multiple libcudart.so are installed and bitsandbytes picks the wrong one
192+
193+
194+
### 0.37.0
195+
196+
#### Int8 Matmul + backward support for all GPUs
197+
198+
Features:
199+
- Int8 MatmulLt now supports backward through inversion of the ColTuring/ColAmpere format. Slow, but memory efficient. Big thanks to @borzunov
200+
- Int8 now supported on all GPUs. On devices with compute capability < 7.5, the Int weights are cast to 16/32-bit for the matrix multiplication. Contributed by @borzunov
201+
202+
Improvements:
203+
- Improved logging for the CUDA detection mechanism.
204+
205+
### 0.38.0
206+
207+
#### 8-bit Lion, Load/Store 8-bit Models directly from/to HF Hub
208+
209+
Features:
210+
- Support for 32 and 8-bit Lion has been added. Thank you @lucidrains
211+
- Support for serialization of Linear8bitLt layers (LLM.int8()). This allows to store and load 8-bit weights directly from the HuggingFace Hub. Thank you @myrab
212+
- New bug report features `python -m bitsandbytes` now gives extensive debugging details to debug CUDA setup failures.
213+
214+
Bug fixes:
215+
- Fixed a bug where some bitsandbytes methods failed in a model-parallel setup on multiple GPUs. Thank you @tonylins
216+
- Fixed a bug where cudart.so libraries could not be found in newer PyTorch releases.
217+
218+
Improvements:
219+
- Improved the CUDA Setup procedure by doing a more extensive search for CUDA libraries
220+
221+
Deprecated:
222+
- Devices with compute capability 3.0 (GTX 700s, K10) and 3.2 (Tegra K1, Jetson TK1) are now deprecated and support will be removed in 0.39.0.
223+
- Support for CUDA 10.0 and 10.2 will be removed in bitsandbytes 0.39.0
224+
225+
226+
### 0.38.1
227+
228+
Features:
229+
- Added Int8 SwitchBack layers
230+
- Added Fake FP8 layers for research purposes (available under `bnb.research.nn. ...`)

0 commit comments

Comments
 (0)