Releases: isaacus-dev/semchunk
Releases · isaacus-dev/semchunk
v4.0.0
Added
- Added a new AI chunking mode to semchunk that leverages Isaacus enrichment models to hierarchically segment texts.
- Made it possible to chunk Isaacus Legal Graph Schema (ILGS) Documents instead of just strings.
- Added a new
tokenizer_kwargsargument tochunkerify()allowing users to specify custom keyword arguments to their tokenizers and token counters.tokenizer_kwargscan be used to override the default behavior of treating any encountered special tokens as if they are normal text when using atiktokenortransformerstokenzier. - Where a
tiktokenortransformerstokenizer is used, started treating special tokens as normal text instead of, in the case oftiktoken, raising an error and, in the case oftransformers, treating them as special tokens. - Added support for Python 3.14.
Changed
- Demoted asterisks in the hierarchy of splitters from sentence terminators to clause separators to better reflect their typical syntactic function.
- Dramatically improved performance when handling extremely long sequences of punctuation characters.
- All arguments to
chunkerify()except for the first two arguments,tokenizer_or_token_counterandchunk_size, are now keyword-only arguments. - All arguments to
chunk()except for the first three,text,chunk_size, andtoken_counter, are now keyword-only arguments. - Significantly improved performance in cases where
merge_splits()was the biggest bottleneck by switching from joining splits with splitters to indexing into the original text. - Slightly sped up
merge_splits()by switching to the standard library'sbisect_left()function which is now faster than the previous implementation.
Removed
- Dropped support for Python 3.9.
v3.2.5
Changed
- Switched to more accurate monthly download counts from pypistats.org rather than the less accurate counts from pepy.tech.
3.2.4
Fixed
- Fixed splitters being sorted lexographically rather than by length, which should improve the meaningfulness of chunks.
v3.2.3
Fixed
- Fixed broken Python download count shield (crflynn/pypistats.org#82).
v3.2.2
v3.2.1
Fixed
- Fixed minor typos in the README and docstrings.
v3.2.0
Changed
- Significantly improved the quality of chunks produced when chunking with low chunk sizes or documents with minimal varying levels of whitespace by adding a new rule to the
semchunkalgorithm that prioritizes splitting at the occurrence of single whitespace characters preceded by hierarchically meaningful non-whitespace characters over splitting at all single whitespace characters in general (#17).
v3.1.3
v3.1.2
Changed
- Changed test model from
isaacus/emuberttoisaacus/kanon-tokenizer.
Full Changelog: v3.1.1...v3.1.2
v3.1.1
Added
- Added a note to the quickstart section of the README advising users to deduct the number of special tokens automatically added by their tokenizer from their chunk size. This note had been removed in version 3.0.0 but has been readded as it is unlikely to be obvious to users.