Release v0.1.3 · google/gemma.cpp

Support for PaliGemma 2 and Gemma 3.
Major update to MatMul and MatMul-using operations; significant performance increases in multiple parts of the codebase.
Codebase simplifications and refactors in many areas.
Bugfixes

What's Changed

Add more ops: Sigmoid, (Two)MatVecAdd. Faster TwoMatVec. by @veluca93 in #129
Improve weight handling. by @veluca93 in #130
Remove unused includes by @copybara-service in #132
Add a benchmark and additional tests. by @veluca93 in #131
Adding Griffin implementation. by @pculliton in #136
Change NumGemmaLayers and NumGriffinLayers to constants in configs by @ufownl in #139
Mention Makefile contributed by @jart by @copybara-service in #141
Refactor data structures to reduce memory usage by @ufownl in #142
Added functionality of storing layers activations output. by @atorero in #145
Further improve IO, enable multiple backends without -D. by @copybara-service in #148
Use lambda to split function and Make stream_token can break prefill by @zeerd in #156
Simplify prefill early-exit (originally Merge #156) by @copybara-service in #158
Fix underflow in NUQ ClusterCost() by @copybara-service in #162
Add error-checking for py binding, add missing include+hwasan check by @copybara-service in #163
Simplify threading: remove the use of inner_pool. by @szabadka in #167
Use more parallelism in the QKV projections in MQA mode. by @szabadka in #170
Fix kv offset computation for MHA config. by @szabadka in #172
Use more parallelism in the final output of the attention block. by @szabadka in #175
Use more parallelism in the QKV projections of the MHA block. by @szabadka in #176
Factor out deinterleaving of bf16 vectors for MatVecs. by @samkaufman in #166
Use more parallelism in attention block in prefill mode. by @szabadka in #177
work with cmake install by @xinpingwang in #169
2x speedup of SFP decode (1.4x overall) on AVX3_DL+. by @copybara-service in #178
Support additional scaling by @copybara-service in #181
Store tokens/sec in auxiliary struct TimingInfo. by @copybara-service in #183
Add TTFT to TimingInfo by @copybara-service in #186
Make BlobWriter::Add() accept const void* by @copybara-service in #188
Adds Kaggle testing to CI workflow by @pculliton in #189
Fix normalization in Softmax function. by @szabadka in #194
Clarified README by @zond in #137
Unrolled / tiled 4x4 MatMul by @copybara-service in #199
Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T by @copybara-service in #202
Add first version of backpropagation support. by @szabadka in #203
Fix for GenerateZeroMat call in TestTiledMatMul by @copybara-service in #206
Remove no longer required stats.h - use Highway version instead by @copybara-service in #208
Simplifications: remove GemmaInterface and GemmaImpl by @copybara-service in #209
Implement mixed mode matmul: f32 * bf16 by @copybara-service in #210
Fix Softmax on SVE by @copybara-service in #213
Fix fix for weight type define, refs #198 by @copybara-service in #216
Add Adam optimizer. by @szabadka in #212
Add support for custom sampling function to runtime config. by @szabadka in #217
Shifting large matrix init to heap in ops_test.cc by @copybara-service in #220
Add CPU output, error if not C++17, simplify tokenizer ctor by @copybara-service in #222
Use CompressedWeights<TConfig> in backpropagation. by @szabadka in #224
Update benchmark with internal init by @copybara-service in #225
Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc by @copybara-service in #227
Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. by @copybara-service in #231
Add benchmark dependency to cmake build. by @szabadka in #234
Fix numerical issue in Softcap by subtracting max. by @copybara-service in #236
Extends Transformer() to prepare for batched processing. by @copybara-service in #238
Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. by @copybara-service in #239
Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, by @copybara-service in #237
Increase parallelism in ops_test by @copybara-service in #233
Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time. by @copybara-service in #241
Reduce duplication in Config* by inheriting no-SSM by @copybara-service in #242
Major duplicated code reduction in test/benchmarks by @copybara-service in #240
Implement a missing (bf16, f32) tiled MatMul kernel. by @copybara-service in #245
Removed now redundant non-batch matmul by @copybara-service in #246
Integrate matmul into FFW: 4.3x prefill speedup by @copybara-service in #243
Internal change. by @copybara-service in #244
Added bias vector addition to MatMul by @copybara-service in #247
Refactor CompressedWeights. by @copybara-service in #248
Fix DASSERT - TiledBatch requires at least 2 vectors. by @copybara-service in #253
Move raw_weights into separate header, used mainly by compress_weights. by @copybara-service in #249
Further simplification to ForEachTensor, thanks I.K. by @copybara-service in #254
Update developer docs and mention asan/msan by @copybara-service in #255
1.15x 7b sfp prefill speedup: Matmul in attention by @copybara-service in #256
Fix Py binding/run_example: use GemmaEnv by @copybara-service in #257
Simplify Attention. by @copybara-service in #258
Fix debug_prompt and other binaries (internal init) by @copybara-service in #259
Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly by @copybara-service in #260
Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions. by @copybara-service in #261
Move test placeholder to a later pos. by @copybara-service in #263
Code cleanup by @copybara-service in #264
Refactor kCachePosSize and kCacheLayerSize into separate functors. by @copybara-service in #262
Fixing two typos. by @copybara-service in #265
Fix compilation errors in clang by @ufownl in #267
Fix KV cache size calculation error by @ufownl in #266
Skip the last RMSNormInplaceBatched in the Prefill phase. by @copybara-service in #268
Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline. by @copybara-service in #270
Use hwy::ThreadPool::MaxThreads() to determine the number of threads to use. by @copybara-service in #251
Fix a clang tidy warning by @copybara-service in #271
Remove unused BUILD dependency by @copybara-service in #272
Refactor model type / training tables, simplify reverse mapping by @copybara-service in #273
Introduce new Gemma 9B and 27B configs by @copybara-service in #274
Add prompt batching to Gemma.cpp. by @copybara-service in #269
Add config for att/final cap, skip max-subtract. Fixes #278 by @copybara-service in #279
Declutter gemma/ directory, move binaries to evals/ and util/. by @copybara-service in #277
Remove unused kSystemPrompt by @copybara-service in #275
Use benchmark_helper in py bindings (adds BOS) by @copybara-service in #282
Cleanup: add ModelInfo struct, remove gcpp:: by @copybara-service in #281
Prep for sharding gemma.cc: split into kv_cache, tokenizer. by @copybara-service in #284
Add sliding window attention for Gemma 2. by @copybara-service in #280
Small cleanups. Fixes gemma_test build. by @copybara-service in #286
7x compile time speedup: shard gemma.cc by @copybara-service in #288
Fix gemma_test - moved to evals/. by @copybara-service in #289
Add Py bindings for weight compression by @copybara-service in #290
Cleanup: move util/compress and convert_weights to compression/ by @copybara-service in #291
Fix handling of %c and %q if eot_string. Fixes #283, thanks @ljcucc by @copybara-service in #292
Update gemma_test with the expected entropy values for the IT models of size 2B/7B/9B/27B. by @copybara-service in #294
Lint fix - string append, remove stale TODO by @copybara-service in #295
Update gemma_test to also pass for the v1.1. models. by @copybara-service in #296
Add more comments to attention computation (and some small restructuring). by @copybara-service in #298
Fix windows build: min conflict, unused VF by @copybara-service in #299
Refactor configurables. by @copybara-service in #297
Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing by @copybara-service in #303
Simplify matmul: only 2 overloads by @copybara-service in #304
SVE build fix: avoid capturing vectors directly. by @copybara-service in #305
Improve readability with RepeatedAttentionWindowSizes by @copybara-service in #302
Increase the prefill batch size to 64. by @copybara-service in #306
Fix gemma_cpp/examples/hello_world build. by @copybara-service in #307
Further 1.02x prefill speedup from batch 64->512 by @copybara-service in #308
Fix examples/hello_world for real. by @copybara-service in #309
Simplify FFW by using MatMul_4x4_Batch_Add. by @copybara-service in #311
De-templatize Activations, add RowVectorBatch class by @copybara-service in #310
Update gemma-27b to the correct query scaling. by @copybara-service in #312
Add scale parameter to MatMul. by @copybara-service in #313
Fix msan uninitialized scale by @copybara-service in #314
Major Prefill/Generate cleanup, 1.3x Prefill speedup by @copybara-service in #315
Cleanup: add wrapper functions and rename vars to interleaved by @copybara-service in #316
Split up ops.h into ops/ops-inl and matmul-inl by @copybara-service in #317
Use all CPU sockets when pinning threads to cores by @copybara-service in #319
Fix msan uninitialized scale in optimize_test by @copybara-service in #320
Minor polishing: adding comments, renaming variables. by @copybara-service in #321
Fix setting scales in Py binding by @copybara-service in #322
Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup by @copybara-service in #325
1.05x prefill speedup: matvec -> matmul for !MHA by @copybara-service in #327
Add Python code for converting Griffin Orbax weights. Refs #301 by @copybara-service in #329
MatMul cleanup: Mat struct, simplify args. by @copybara-service in #330
Fix Windows build - macro conflict with param name by @copybara-service in #331
Extend LayersOutputFunc to take query index and auxillary int by @copybara-service in #328
Split matmul into matvec; add large matrix benchmark by @copybara-service in #333
Internal change by @copybara-service in #326
SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill by @copybara-service in #335
1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. by @copybara-service in #334
Improve performance logging by @copybara-service in #336
1.03-1.08x decode speedup: precompute Rope theta, fuse by @copybara-service in #339
Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. by @copybara-service in #342
Add pin flag to disable pinning. Refs #338 by @copybara-service in #343
1.3x prefill, 0.95x decode: matmul replacing last matvec by @copybara-service in #345
Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it. by @copybara-service in #346
0.98x prefill: refactor in prep for cache blocking. by @copybara-service in #347
Implement start_pos per query for batch interface (reopen) by @ufownl in #348
Simplify pos handling, auto-increment output arg by @copybara-service in #350
Support directly observing activations, partially replacing LayersOutputFunc by @copybara-service in #351
Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul by @copybara-service in #352
Expose underlying model configuration: number of layers, heads, etc. by @copybara-service in #354
VectorizedRopeAndMulBy. by @copybara-service in #355
Fix prefill for batched queries. by @copybara-service in #353
Vectorize Rope for qkv dim not evenly divisible by number of lanes. by @copybara-service in #356
Fix test for 2b - update prompt by @copybara-service in #358
Minor followup: remainder handling is a single iteration by @copybara-service in #359
Experiment with compensated dot product. by @copybara-service in #357
Avoid duplication of RMSNorm, support all activation/weight types by @copybara-service in #360
Demonstrate constrained decoding in gemma_cpp's hello world example by @copybara-service in #363
Add an additional QueryModel() overload to GemmaEnv. by @copybara-service in #362
Internal change. Slight restructuring of gemma_test. by @copybara-service in #367
1.22x NUQ compress speedup, fix out of bounds access, improve numerics by @copybara-service in #366
Fix NUQ for SVE - incorrect nibble packing by @copybara-service in #368
Further nuq_test speedups to prevent timeout by @copybara-service in #371
Refactor/cleanup, remove even_odd by @copybara-service in #372
Minor cleanup/fixes: by @copybara-service in #375
Major compression update, arbitrary-len unpack + new Dot by @copybara-service in #374
Fix mismatch between blob_store and compress interfaces (bytes) by @copybara-service in #376
Adds insert_float() to SbsWriter() to store a float array directly. by @copybara-service in #378
Implement scalar version of LayerNorm by @copybara-service in #379
Add const batch accessor to RowVectorBatch. by @copybara-service in #381
Add entropy expectations for Griffin-2b model in gemma_test and make sure it passes. by @copybara-service in #382
Add tests for SampleTopK that highlight existing problems and fix those: by @copybara-service in #383
Add pairwise sum dot products for testing by @copybara-service in #386
Fix the warnings complained by Clang by @ufownl in #380
Cascaded summation for Softmax by @copybara-service in #388
Fix compress-inl bf16->f32 overrun by @copybara-service in #390
Fix topology display for platforms where it fails (Apple) by @copybara-service in #391
Update expected entropy values for GRIFFIN_2B model. by @copybara-service in #392
Add forward and backward error by @copybara-service in #389
Fix prefix-LM mode assertion by @ufownl in #394
Reduce flakiness of dot_test. by @copybara-service in #396
1.6x speedup of MatMulSlow using compensated Dot by @copybara-service in #397
Add download location of Pali Gemma weights to README.md. by @copybara-service in #398
Tiny update of the README formatting. by @copybara-service in #399
Add double-precision dot variant by @copybara-service in #393
Use f64 Dot and sum in softmax - faster than Cascaded by @copybara-service in #400
1.09x decode speedup for topk=1/temp0: fuse softmax and sample by @copybara-service in #402
Rename one variable in SampleTopK and update TestSampleTopK. by @copybara-service in #404
Minor fix to profiler zone and add comment by @copybara-service in #407
Internal change. by @copybara-service in #408
Internal change. by @copybara-service in #377
Fix MSAN issue for multiturn. Rewind the prior EOS token. by @copybara-service in #412
Reduce number of operations in Gelu() by one Mul. by @copybara-service in #414
Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. by @copybara-service in #417
Update expected ranges in dot_test. by @copybara-service in #420
Remove unused "two-sizes" version of MulByConstAndAdd. by @copybara-service in #421
Benchmark gemma.cpp with different length inputs. by @copybara-service in #416
Fix PaliGemma model loading. by @copybara-service in #425
Fix compilation error of the weights compression tool by @ufownl in #422
Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. by @copybara-service in #419
Eliminated TConfig. by @copybara-service in #428
Fix PaliGemma's GenerateImageTokensT(). by @copybara-service in #430
Use NestedPools, add NUMA infra by @copybara-service in #427
Fix compilation errors of "compress_weights" target by @ufownl in #432
Add overloads of Image::ReadPPM method by @ufownl in #426
New blob_store_test, ensure ReadOne checks actual size against requested size by @copybara-service in #433
Add a compilation option to disable topology by @ufownl in #435
Serialization for class members for use with ModelConfig by @copybara-service in #436
Warning fixes (casts) and fix Windows build for aligned_alloc by @copybara-service in #437
Factor out addition of ViTConfig to a ModelConfig. by @copybara-service in #438
Simpler MatMul interface, vocab types, Tristate for use_spinning by @copybara-service in #442
Expose BlobReader::Keys() by @copybara-service in #443
Fix Griffin model: by @copybara-service in #444
Replace CLIF SbsWriter with pybind-based gcpp extension by @copybara-service in #445
Added a blob_compare tool that compares two sbs files that may have the blobs in a different order by @copybara-service in #448
Internal change. by @copybara-service in #450
Added pybind for configs. by @copybara-service in #449
Improved consistency of compressor API, and added a universal method with a target type arg. by @copybara-service in #452
Add a simple benchmark for batching. by @copybara-service in #453
Threading/infra improvements. by @copybara-service in #455
Print cache info and update Highway version for that by @copybara-service in #456
Internal change by @copybara-service in #457
Add support for 448px resolution to PaliGemma and PaliGemma2. by @copybara-service in #459
Tiny cleanup. by @copybara-service in #461
Refactor gemma/common.cc to improve readability and safety by @ericcurtin in #460
Internal change by @copybara-service in #462
Fix unhandled switch warning/error by @copybara-service in #463
Added the TensorInfo arg to the compressor so the shape and scale can be output correctly to the file in future. by @copybara-service in #454
Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. by @copybara-service in #464
Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT by @copybara-service in #465
Rename ModelTraining to PromptWrapping which is a more accurate name. by @copybara-service in #466
Small updates to the README file. by @copybara-service in #467
Internal change by @copybara-service in #468
Added ability to load/save a complete model file, including tokenizer. by @copybara-service in #469
Moved the vit config fields to their own config struct by @copybara-service in #471
Allow interactive use with new single-file weight format. by @copybara-service in #472
Add the missing migrate_weights target for CMake by @ufownl in #473
Tiny fix: align template parameter order with parameter order. by @copybara-service in #476
Add parameter for base_frequency to CreateInvTimeScale(). by @copybara-service in #477
Infra improvements (2) by @copybara-service in #474
internal change by @copybara-service in #478
Allow overriding num threads despite detecting topology by @copybara-service in #480
Assorted small cleanups. by @copybara-service in #482
Add python wrappers for configs and inference. by @copybara-service in #481
Simplified interface class and example for Gemma.cpp usage. by @copybara-service in #483
Base interleaved handling for 4.5-bit NUQ, specifically Enc, DecompressAndZeroPad, and Dec2. Includes tests. by @copybara-service in #484
Allow conversion, loading and inference with NUQ. by @copybara-service in #485
Improved blob diff: parallel, tolerance for float by @copybara-service in #489
Remove srcs_version and python_version attributes, as they already default to "PY3" by @copybara-service in #487
Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts by @copybara-service in #492
Add fork/join latency benchmark by @copybara-service in #496
Fix nuq Enc() to handle groups < kGroupSize. by @copybara-service in #497
Using TimingInfo methods and cleaning up args to DecodeStepT by @copybara-service in #499
Fix the link error when building compress_weights with Clang on macOS by @ufownl in #493
Add conversion tool for HF safetensors to gemma.cpp for PaliGemma. by @copybara-service in #498
Less verbose threading_test output, improve formatting. by @copybara-service in #500
Only temporarily enable spinning in threading benchmark by @copybara-service in #503
Implements FusedSoftmaxAndSampleTopK. by @copybara-service in #502
Use vectorized TopK using highway VQSelect by @copybara-service in #505
Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning by @copybara-service in #488
Support bf16 output of Matmul by @copybara-service in #511
Internal change. by @copybara-service in #514
Internal change. by @copybara-service in #515
Update github actions/cache version by @copybara-service in #517
Fix PaliGemma models. by @copybara-service in #519

New Contributors

@veluca93 made their first contribution in #129
@atorero made their first contribution in #145
@samkaufman made their first contribution in #166
@xinpingwang made their first contribution in #169
@zond made their first contribution in #137
@ericcurtin made their first contribution in #460

Full Changelog: v0.1.2...v0.1.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.1.3

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!