v0.1.3
- Support for PaliGemma 2 and Gemma 3.
- Major update to MatMul and MatMul-using operations; significant performance increases in multiple parts of the codebase.
- Codebase simplifications and refactors in many areas.
- Bugfixes
What's Changed
- Add more ops: Sigmoid, (Two)MatVecAdd. Faster TwoMatVec. by @veluca93 in #129
- Improve weight handling. by @veluca93 in #130
- Remove unused includes by @copybara-service in #132
- Add a benchmark and additional tests. by @veluca93 in #131
- Adding Griffin implementation. by @pculliton in #136
- Change
NumGemmaLayersandNumGriffinLayersto constants in configs by @ufownl in #139 - Mention Makefile contributed by @jart by @copybara-service in #141
- Refactor data structures to reduce memory usage by @ufownl in #142
- Added functionality of storing layers activations output. by @atorero in #145
- Further improve IO, enable multiple backends without -D. by @copybara-service in #148
- Use lambda to split function and Make stream_token can break prefill by @zeerd in #156
- Simplify prefill early-exit (originally Merge #156) by @copybara-service in #158
- Fix underflow in NUQ ClusterCost() by @copybara-service in #162
- Add error-checking for py binding, add missing include+hwasan check by @copybara-service in #163
- Simplify threading: remove the use of inner_pool. by @szabadka in #167
- Use more parallelism in the QKV projections in MQA mode. by @szabadka in #170
- Fix kv offset computation for MHA config. by @szabadka in #172
- Use more parallelism in the final output of the attention block. by @szabadka in #175
- Use more parallelism in the QKV projections of the MHA block. by @szabadka in #176
- Factor out deinterleaving of bf16 vectors for MatVecs. by @samkaufman in #166
- Use more parallelism in attention block in prefill mode. by @szabadka in #177
- work with cmake install by @xinpingwang in #169
- 2x speedup of SFP decode (1.4x overall) on AVX3_DL+. by @copybara-service in #178
- Support additional scaling by @copybara-service in #181
- Store tokens/sec in auxiliary struct TimingInfo. by @copybara-service in #183
- Add TTFT to TimingInfo by @copybara-service in #186
- Make BlobWriter::Add() accept const void* by @copybara-service in #188
- Adds Kaggle testing to CI workflow by @pculliton in #189
- Fix normalization in Softmax function. by @szabadka in #194
- Clarified README by @zond in #137
- Unrolled / tiled 4x4 MatMul by @copybara-service in #199
- Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T by @copybara-service in #202
- Add first version of backpropagation support. by @szabadka in #203
- Fix for GenerateZeroMat call in TestTiledMatMul by @copybara-service in #206
- Remove no longer required stats.h - use Highway version instead by @copybara-service in #208
- Simplifications: remove GemmaInterface and GemmaImpl by @copybara-service in #209
- Implement mixed mode matmul: f32 * bf16 by @copybara-service in #210
- Fix Softmax on SVE by @copybara-service in #213
- Fix fix for weight type define, refs #198 by @copybara-service in #216
- Add Adam optimizer. by @szabadka in #212
- Add support for custom sampling function to runtime config. by @szabadka in #217
- Shifting large matrix init to heap in ops_test.cc by @copybara-service in #220
- Add CPU output, error if not C++17, simplify tokenizer ctor by @copybara-service in #222
- Use CompressedWeights<TConfig> in backpropagation. by @szabadka in #224
- Update benchmark with internal init by @copybara-service in #225
- Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc by @copybara-service in #227
- Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. by @copybara-service in #231
- Add benchmark dependency to cmake build. by @szabadka in #234
- Fix numerical issue in Softcap by subtracting max. by @copybara-service in #236
- Extends Transformer() to prepare for batched processing. by @copybara-service in #238
- Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. by @copybara-service in #239
- Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, by @copybara-service in #237
- Increase parallelism in ops_test by @copybara-service in #233
- Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time. by @copybara-service in #241
- Reduce duplication in Config* by inheriting no-SSM by @copybara-service in #242
- Major duplicated code reduction in test/benchmarks by @copybara-service in #240
- Implement a missing (bf16, f32) tiled MatMul kernel. by @copybara-service in #245
- Removed now redundant non-batch matmul by @copybara-service in #246
- Integrate matmul into FFW: 4.3x prefill speedup by @copybara-service in #243
- Internal change. by @copybara-service in #244
- Added bias vector addition to MatMul by @copybara-service in #247
- Refactor CompressedWeights. by @copybara-service in #248
- Fix DASSERT - TiledBatch requires at least 2 vectors. by @copybara-service in #253
- Move raw_weights into separate header, used mainly by compress_weights. by @copybara-service in #249
- Further simplification to ForEachTensor, thanks I.K. by @copybara-service in #254
- Update developer docs and mention asan/msan by @copybara-service in #255
- 1.15x 7b sfp prefill speedup: Matmul in attention by @copybara-service in #256
- Fix Py binding/run_example: use GemmaEnv by @copybara-service in #257
- Simplify Attention. by @copybara-service in #258
- Fix debug_prompt and other binaries (internal init) by @copybara-service in #259
- Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly by @copybara-service in #260
- Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions. by @copybara-service in #261
- Move test placeholder to a later pos. by @copybara-service in #263
- Code cleanup by @copybara-service in #264
- Refactor kCachePosSize and kCacheLayerSize into separate functors. by @copybara-service in #262
- Fixing two typos. by @copybara-service in #265
- Fix compilation errors in clang by @ufownl in #267
- Fix KV cache size calculation error by @ufownl in #266
- Skip the last RMSNormInplaceBatched in the Prefill phase. by @copybara-service in #268
- Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline. by @copybara-service in #270
- Use hwy::ThreadPool::MaxThreads() to determine the number of threads to use. by @copybara-service in #251
- Fix a clang tidy warning by @copybara-service in #271
- Remove unused BUILD dependency by @copybara-service in #272
- Refactor model type / training tables, simplify reverse mapping by @copybara-service in #273
- Introduce new Gemma 9B and 27B configs by @copybara-service in #274
- Add prompt batching to Gemma.cpp. by @copybara-service in #269
- Add config for att/final cap, skip max-subtract. Fixes #278 by @copybara-service in #279
- Declutter gemma/ directory, move binaries to evals/ and util/. by @copybara-service in #277
- Remove unused kSystemPrompt by @copybara-service in #275
- Use benchmark_helper in py bindings (adds BOS) by @copybara-service in #282
- Cleanup: add ModelInfo struct, remove gcpp:: by @copybara-service in #281
- Prep for sharding gemma.cc: split into kv_cache, tokenizer. by @copybara-service in #284
- Add sliding window attention for Gemma 2. by @copybara-service in #280
- Small cleanups. Fixes gemma_test build. by @copybara-service in #286
- 7x compile time speedup: shard gemma.cc by @copybara-service in #288
- Fix gemma_test - moved to evals/. by @copybara-service in #289
- Add Py bindings for weight compression by @copybara-service in #290
- Cleanup: move util/compress and convert_weights to compression/ by @copybara-service in #291
- Fix handling of %c and %q if eot_string. Fixes #283, thanks @ljcucc by @copybara-service in #292
- Update gemma_test with the expected entropy values for the IT models of size 2B/7B/9B/27B. by @copybara-service in #294
- Lint fix - string append, remove stale TODO by @copybara-service in #295
- Update gemma_test to also pass for the v1.1. models. by @copybara-service in #296
- Add more comments to attention computation (and some small restructuring). by @copybara-service in #298
- Fix windows build: min conflict, unused VF by @copybara-service in #299
- Refactor configurables. by @copybara-service in #297
- Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing by @copybara-service in #303
- Simplify matmul: only 2 overloads by @copybara-service in #304
- SVE build fix: avoid capturing vectors directly. by @copybara-service in #305
- Improve readability with RepeatedAttentionWindowSizes by @copybara-service in #302
- Increase the prefill batch size to 64. by @copybara-service in #306
- Fix gemma_cpp/examples/hello_world build. by @copybara-service in #307
- Further 1.02x prefill speedup from batch 64->512 by @copybara-service in #308
- Fix examples/hello_world for real. by @copybara-service in #309
- Simplify FFW by using MatMul_4x4_Batch_Add. by @copybara-service in #311
- De-templatize Activations, add RowVectorBatch class by @copybara-service in #310
- Update gemma-27b to the correct query scaling. by @copybara-service in #312
- Add scale parameter to MatMul. by @copybara-service in #313
- Fix msan uninitialized scale by @copybara-service in #314
- Major Prefill/Generate cleanup, 1.3x Prefill speedup by @copybara-service in #315
- Cleanup: add wrapper functions and rename vars to interleaved by @copybara-service in #316
- Split up ops.h into ops/ops-inl and matmul-inl by @copybara-service in #317
- Use all CPU sockets when pinning threads to cores by @copybara-service in #319
- Fix msan uninitialized scale in optimize_test by @copybara-service in #320
- Minor polishing: adding comments, renaming variables. by @copybara-service in #321
- Fix setting scales in Py binding by @copybara-service in #322
- Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup by @copybara-service in #325
- 1.05x prefill speedup: matvec -> matmul for !MHA by @copybara-service in #327
- Add Python code for converting Griffin Orbax weights. Refs #301 by @copybara-service in #329
- MatMul cleanup: Mat struct, simplify args. by @copybara-service in #330
- Fix Windows build - macro conflict with param name by @copybara-service in #331
- Extend LayersOutputFunc to take query index and auxillary int by @copybara-service in #328
- Split matmul into matvec; add large matrix benchmark by @copybara-service in #333
- Internal change by @copybara-service in #326
- SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill by @copybara-service in #335
- 1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. by @copybara-service in #334
- Improve performance logging by @copybara-service in #336
- 1.03-1.08x decode speedup: precompute Rope theta, fuse by @copybara-service in #339
- Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. by @copybara-service in #342
- Add pin flag to disable pinning. Refs #338 by @copybara-service in #343
- 1.3x prefill, 0.95x decode: matmul replacing last matvec by @copybara-service in #345
- Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it. by @copybara-service in #346
- 0.98x prefill: refactor in prep for cache blocking. by @copybara-service in #347
- Implement
start_posper query for batch interface (reopen) by @ufownl in #348 - Simplify pos handling, auto-increment output arg by @copybara-service in #350
- Support directly observing activations, partially replacing LayersOutputFunc by @copybara-service in #351
- Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul by @copybara-service in #352
- Expose underlying model configuration: number of layers, heads, etc. by @copybara-service in #354
- VectorizedRopeAndMulBy. by @copybara-service in #355
- Fix prefill for batched queries. by @copybara-service in #353
- Vectorize Rope for qkv dim not evenly divisible by number of lanes. by @copybara-service in #356
- Fix test for 2b - update prompt by @copybara-service in #358
- Minor followup: remainder handling is a single iteration by @copybara-service in #359
- Experiment with compensated dot product. by @copybara-service in #357
- Avoid duplication of RMSNorm, support all activation/weight types by @copybara-service in #360
- Demonstrate constrained decoding in gemma_cpp's hello world example by @copybara-service in #363
- Add an additional QueryModel() overload to GemmaEnv. by @copybara-service in #362
- Internal change. Slight restructuring of gemma_test. by @copybara-service in #367
- 1.22x NUQ compress speedup, fix out of bounds access, improve numerics by @copybara-service in #366
- Fix NUQ for SVE - incorrect nibble packing by @copybara-service in #368
- Further nuq_test speedups to prevent timeout by @copybara-service in #371
- Refactor/cleanup, remove even_odd by @copybara-service in #372
- Minor cleanup/fixes: by @copybara-service in #375
- Major compression update, arbitrary-len unpack + new Dot by @copybara-service in #374
- Fix mismatch between blob_store and compress interfaces (bytes) by @copybara-service in #376
- Adds insert_float() to SbsWriter() to store a float array directly. by @copybara-service in #378
- Implement scalar version of LayerNorm by @copybara-service in #379
- Add const batch accessor to RowVectorBatch. by @copybara-service in #381
- Add entropy expectations for Griffin-2b model in gemma_test and make sure it passes. by @copybara-service in #382
- Add tests for SampleTopK that highlight existing problems and fix those: by @copybara-service in #383
- Add pairwise sum dot products for testing by @copybara-service in #386
- Fix the warnings complained by Clang by @ufownl in #380
- Cascaded summation for Softmax by @copybara-service in #388
- Fix compress-inl bf16->f32 overrun by @copybara-service in #390
- Fix topology display for platforms where it fails (Apple) by @copybara-service in #391
- Update expected entropy values for GRIFFIN_2B model. by @copybara-service in #392
- Add forward and backward error by @copybara-service in #389
- Fix prefix-LM mode assertion by @ufownl in #394
- Reduce flakiness of dot_test. by @copybara-service in #396
- 1.6x speedup of MatMulSlow using compensated Dot by @copybara-service in #397
- Add download location of Pali Gemma weights to README.md. by @copybara-service in #398
- Tiny update of the README formatting. by @copybara-service in #399
- Add double-precision dot variant by @copybara-service in #393
- Use f64 Dot and sum in softmax - faster than Cascaded by @copybara-service in #400
- 1.09x decode speedup for topk=1/temp0: fuse softmax and sample by @copybara-service in #402
- Rename one variable in SampleTopK and update TestSampleTopK. by @copybara-service in #404
- Minor fix to profiler zone and add comment by @copybara-service in #407
- Internal change. by @copybara-service in #408
- Internal change. by @copybara-service in #377
- Fix MSAN issue for multiturn. Rewind the prior EOS token. by @copybara-service in #412
- Reduce number of operations in Gelu() by one Mul. by @copybara-service in #414
- Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. by @copybara-service in #417
- Update expected ranges in dot_test. by @copybara-service in #420
- Remove unused "two-sizes" version of MulByConstAndAdd. by @copybara-service in #421
- Benchmark gemma.cpp with different length inputs. by @copybara-service in #416
- Fix PaliGemma model loading. by @copybara-service in #425
- Fix compilation error of the weights compression tool by @ufownl in #422
- Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. by @copybara-service in #419
- Eliminated TConfig. by @copybara-service in #428
- Fix PaliGemma's GenerateImageTokensT(). by @copybara-service in #430
- Use NestedPools, add NUMA infra by @copybara-service in #427
- Fix compilation errors of "compress_weights" target by @ufownl in #432
- Add overloads of
Image::ReadPPMmethod by @ufownl in #426 - New blob_store_test, ensure ReadOne checks actual size against requested size by @copybara-service in #433
- Add a compilation option to disable topology by @ufownl in #435
- Serialization for class members for use with ModelConfig by @copybara-service in #436
- Warning fixes (casts) and fix Windows build for aligned_alloc by @copybara-service in #437
- Factor out addition of ViTConfig to a ModelConfig. by @copybara-service in #438
- Simpler MatMul interface, vocab types, Tristate for use_spinning by @copybara-service in #442
- Expose BlobReader::Keys() by @copybara-service in #443
- Fix Griffin model: by @copybara-service in #444
- Replace CLIF SbsWriter with pybind-based gcpp extension by @copybara-service in #445
- Added a blob_compare tool that compares two sbs files that may have the blobs in a different order by @copybara-service in #448
- Internal change. by @copybara-service in #450
- Added pybind for configs. by @copybara-service in #449
- Improved consistency of compressor API, and added a universal method with a target type arg. by @copybara-service in #452
- Add a simple benchmark for batching. by @copybara-service in #453
- Threading/infra improvements. by @copybara-service in #455
- Print cache info and update Highway version for that by @copybara-service in #456
- Internal change by @copybara-service in #457
- Add support for 448px resolution to PaliGemma and PaliGemma2. by @copybara-service in #459
- Tiny cleanup. by @copybara-service in #461
- Refactor
gemma/common.ccto improve readability and safety by @ericcurtin in #460 - Internal change by @copybara-service in #462
- Fix unhandled switch warning/error by @copybara-service in #463
- Added the TensorInfo arg to the compressor so the shape and scale can be output correctly to the file in future. by @copybara-service in #454
- Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. by @copybara-service in #464
- Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT by @copybara-service in #465
- Rename ModelTraining to PromptWrapping which is a more accurate name. by @copybara-service in #466
- Small updates to the README file. by @copybara-service in #467
- Internal change by @copybara-service in #468
- Added ability to load/save a complete model file, including tokenizer. by @copybara-service in #469
- Moved the vit config fields to their own config struct by @copybara-service in #471
- Allow interactive use with new single-file weight format. by @copybara-service in #472
- Add the missing
migrate_weightstarget for CMake by @ufownl in #473 - Tiny fix: align template parameter order with parameter order. by @copybara-service in #476
- Add parameter for base_frequency to CreateInvTimeScale(). by @copybara-service in #477
- Infra improvements (2) by @copybara-service in #474
- internal change by @copybara-service in #478
- Allow overriding num threads despite detecting topology by @copybara-service in #480
- Assorted small cleanups. by @copybara-service in #482
- Add python wrappers for configs and inference. by @copybara-service in #481
- Simplified interface class and example for Gemma.cpp usage. by @copybara-service in #483
- Base interleaved handling for 4.5-bit NUQ, specifically Enc, DecompressAndZeroPad, and Dec2. Includes tests. by @copybara-service in #484
- Allow conversion, loading and inference with NUQ. by @copybara-service in #485
- Improved blob diff: parallel, tolerance for float by @copybara-service in #489
- Remove
srcs_versionandpython_versionattributes, as they already default to"PY3"by @copybara-service in #487 - Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts by @copybara-service in #492
- Add fork/join latency benchmark by @copybara-service in #496
- Fix nuq Enc() to handle groups < kGroupSize. by @copybara-service in #497
- Using TimingInfo methods and cleaning up args to DecodeStepT by @copybara-service in #499
- Fix the link error when building
compress_weightswith Clang on macOS by @ufownl in #493 - Add conversion tool for HF safetensors to gemma.cpp for PaliGemma. by @copybara-service in #498
- Less verbose threading_test output, improve formatting. by @copybara-service in #500
- Only temporarily enable spinning in threading benchmark by @copybara-service in #503
- Implements FusedSoftmaxAndSampleTopK. by @copybara-service in #502
- Use vectorized TopK using highway VQSelect by @copybara-service in #505
- Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning by @copybara-service in #488
- Support bf16 output of Matmul by @copybara-service in #511
- Internal change. by @copybara-service in #514
- Internal change. by @copybara-service in #515
- Update github actions/cache version by @copybara-service in #517
- Fix PaliGemma models. by @copybara-service in #519
New Contributors
- @veluca93 made their first contribution in #129
- @atorero made their first contribution in #145
- @samkaufman made their first contribution in #166
- @xinpingwang made their first contribution in #169
- @zond made their first contribution in #137
- @ericcurtin made their first contribution in #460
Full Changelog: v0.1.2...v0.1.3