Releases: finalfusion/finalfusion-rust
Releases · finalfusion/finalfusion-rust
0.18.0
0.17.2
- Add
WriteEmbeddings::write_embeddings_len. This method returns the serialized length of embeddings in finalfusion format, without performing any serialization. - Add
WriteChunk::chunk_len. This method returns the serialized length of a finalfusion chunk, without performing any serialization. - Switch the license to Apache License 2.0 or MIT
Add support for Floret embeddings
- Add support for reading, writing, and using Floret embeddings.
- Add a finalfusion chunk type for Floret-like vocabularies.
- Add support for batched embedding lookups (
embedding_batchandembedding_batch_into) - Improve error handling:
- Mark wrapped errors using
#[source]to get better chains of error messages. - Split
Error::IointoError::ReadandError::Write. - Rename some
Errorvariants.
- Mark wrapped errors using
Subword vocabulary conversion
- Add conversion from bucketed subword to explicit subword embeddings.
- Hide
WordSimilarityResultfields. Use thecosine_similarityandwordmethods instead.
Faster lookup of OPQ-quantized embeddings
- Make lookups of unknown words in OPQ-quantized embedding matrices 2.6x faster (resulting in ~1.6x faster allround lookups).
- Add the
Reconstructtrait is a counterpart toQuantize. This trait can be used to reconstruct quantized embedding matrices. Using this trait is also much faster than reconstructing individual embeddings. - Add more I/O checks to ensure that the embedding matrix can actually be represented in the native
usize.
Improved error handling
Modernize and improve error handling
- Merge the
ErrorandErrorKindenums. - Move the
Errorenum to theerrormodule. - Derive trait implementations using the
thiserrorcrate. - Make the
Errorenum non-exhaustive - Replace the
ChunkIdentifier::try_frommethod by an implementation of theTryFromcrate.
This release also feature-gates the memmap dependency (the memmap feature is enabled by default).
Explicit n-gram vocabularies and first API-stable release
- Add
ExplicitVocab, a subword vocabulary that stores n-grams explicitly. - Add the
Embedding::intomethod. This method realizes an embedding into a user-provided array. - Support big-endian architectures.
- Add
WordIndex::wordandWordIndex::subwordmethods. These will return an Option with the word index or subword indices, as applicable. - Expose the quantizer in
(Mmap)QuantizedArraythrough thequantizermethod. - Add benchmarks for array and quantized embeddings.
- Split
WordSimilarityintoWordSimilarityandWordSimilarityBy;EmbeddingSimilarityintoEmbeddingSimilarityandEmbeddingSimilarityBy. - Rename
FinalfusionSubwordVocabtoBucketSubwordVocab. - Expose fewer types through the prelude.
- Hide the
chunksmodule. E.g.chunks::storagebecomesstorage.
Reductive 0.3
This is a small update, that updates the reductive dependency to 0.3, which has a crucial bug fix for training product quantizers in multiple attempts. However, reductive 0.3 also requires rand 0.7, resulting in a changed API. Therefore, we have to bump the leading version number from 0.9 to 0.10.
Memory-mapped quantized arrays
- Add the
MmapQuantizedArraystorage type. - Rename
Vocab::lentoVocab::words_len. - Add
Vocab::vocab_lento get the vocabulary size including subword
indices.
Token robustness
- Improve reading of embeddings that contain unicode whitespace in tokens.
- Add lossy variants of the text/word2vec/fasttext reading methods. The lossy variants read tokens with invalid UTF-8 byte sequences.