semchunk is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
This is a Ruby port of the Python semchunk library by Isaacus, maintaining the same efficient chunking algorithm and API design.
semchunk produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
- Semantic chunking: Splits text at natural boundaries (sentences, paragraphs, etc.) rather than at arbitrary character positions
- Token-aware: Respects token limits from any tokenizer you provide
- Overlap support: Create overlapping chunks for better context preservation
- Offset tracking: Get the original positions of each chunk in the source text
- Flexible: Works with any token counter (word count, character count, or tokenizers)
- Memoization: Optional caching of token counts for improved performance
Add this line to your application's Gemfile:
gem 'semchunk'Or install it directly:
gem install semchunkrequire "semchunk"
# Define a simple token counter (or use a real tokenizer)
token_counter = ->(text) { text.split.length }
# Chunk some text
text = "This is the first sentence. This is the second sentence. And this is the third sentence."
chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
puts chunks.inspect
# => ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]Split a text into semantically meaningful chunks.
Semchunk.chunk(
text,
chunk_size:,
token_counter:,
memoize: true,
offsets: false,
overlap: nil,
cache_maxsize: nil
)Parameters:
text(String): The text to be chunkedchunk_size(Integer): The maximum number of tokens a chunk may containtoken_counter(Proc, Lambda, Method): A callable that takes a string and returns the number of tokens in itmemoize(Boolean, optional): Whether to memoize the token counter. Defaults totrueoffsets(Boolean, optional): Whether to return the start and end offsets of each chunk. Defaults tofalseoverlap(Float, Integer, nil, optional): The proportion of the chunk size (if < 1), or the number of tokens (if >= 1), by which chunks should overlap. Defaults tonilcache_maxsize(Integer, nil, optional): The maximum number of text-token count pairs to cache. Defaults tonil(unbounded)
Returns:
Array<String>ifoffsets: false: List of text chunks[Array<String>, Array<Array<Integer>>]ifoffsets: true: List of chunks and their[start, end]offsets
Create a reusable chunker object.
Semchunk.chunkerify(
tokenizer_or_token_counter,
chunk_size: nil,
max_token_chars: nil,
memoize: true,
cache_maxsize: nil
)Parameters:
tokenizer_or_token_counter: A tokenizer object with anencodemethod, or a callable token counterchunk_size(Integer, nil): Maximum tokens per chunk. Ifnil, will attempt to use tokenizer'smodel_max_lengthmax_token_chars(Integer, nil): Maximum characters per token (optimization parameter)memoize(Boolean): Whether to cache token counts. Defaults totruecache_maxsize(Integer, nil): Cache size limit. Defaults tonil(unbounded)
Returns:
Semchunk::Chunker: A chunker instance
Process text(s) with the chunker.
chunker.call(
text_or_texts,
processes: 1,
progress: false,
offsets: false,
overlap: nil
)Parameters:
text_or_texts(String, Array): Single text or array of texts to chunkprocesses(Integer): Number of processes for parallel chunking (not yet implemented)progress(Boolean): Show progress bar for multiple texts (not yet implemented)offsets(Boolean): Return offset informationoverlap(Float, Integer, nil): Overlap configuration
Returns:
- For single text:
Array<String>or[Array<String>, Array<Array<Integer>>] - For multiple texts:
Array<Array<String>>or[Array<Array<String>>, Array<Array<Array<Integer>>>]
require "semchunk"
text = "Natural language processing is fascinating. It allows computers to understand human language. This enables many applications."
# Use word count as token counter
token_counter = ->(text) { text.split.length }
chunks = Semchunk.chunk(text, chunk_size: 8, token_counter: token_counter)
chunks.each_with_index do |chunk, i|
puts "Chunk #{i + 1}: #{chunk}"
end
# => Chunk 1: Natural language processing is fascinating. It allows computers
# => Chunk 2: to understand human language. This enables many applications.Track where each chunk came from in the original text:
text = "First paragraph here. Second paragraph here. Third paragraph here."
token_counter = ->(text) { text.split.length }
chunks, offsets = Semchunk.chunk(
text,
chunk_size: 5,
token_counter: token_counter,
offsets: true
)
chunks.zip(offsets).each do |chunk, (start_pos, end_pos)|
puts "Chunk: '#{chunk}'"
puts "Position: #{start_pos}...#{end_pos}"
puts "Verification: '#{text[start_pos...end_pos]}'"
puts
endCreate overlapping chunks to maintain context:
text = "One two three four five six seven eight nine ten."
token_counter = ->(text) { text.split.length }
# 50% overlap
chunks = Semchunk.chunk(
text,
chunk_size: 4,
token_counter: token_counter,
overlap: 0.5
)
puts "Overlapping chunks:"
chunks.each { |chunk| puts "- #{chunk}" }
# Fixed overlap of 2 tokens
chunks = Semchunk.chunk(
text,
chunk_size: 6,
token_counter: token_counter,
overlap: 2
)
puts "\nWith 2-token overlap:"
chunks.each { |chunk| puts "- #{chunk}" }# Create a chunker once
token_counter = ->(text) { text.split.length }
chunker = Semchunk.chunkerify(token_counter, chunk_size: 10)
# Use it multiple times
texts = [
"First document to process.",
"Second document to process.",
"Third document to process."
]
all_chunks = chunker.call(texts)
all_chunks.each_with_index do |chunks, i|
puts "Document #{i + 1} chunks: #{chunks.inspect}"
endtext = "abcdefghijklmnopqrstuvwxyz"
# Character count as token counter
token_counter = ->(text) { text.length }
chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
puts chunks.inspect
# => ["abcde", "fghij", "klmno", "pqrst", "uvwxy", "z"]# Token counter that counts punctuation as separate tokens
def custom_token_counter(text)
text.scan(/\w+|[^\w\s]/).length
end
text = "Hello, world! How are you?"
chunks = Semchunk.chunk(
text,
chunk_size: 5,
token_counter: method(:custom_token_counter)
)
puts chunks.inspectIf you have a tokenizer that implements an encode method:
# Example with a hypothetical tokenizer
class MyTokenizer
def encode(text, add_special_tokens: true)
# Your tokenization logic here
text.split.map { |word| word.hash }
end
def model_max_length
512
end
end
tokenizer = MyTokenizer.new
# chunkerify will automatically extract the token counter
chunker = Semchunk.chunkerify(tokenizer, chunk_size: 100)
text = "Your long text here..."
chunks = chunker.call(text)semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
- Splits text using the most semantically meaningful splitter possible;
- Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
- Merges any chunks that are under the chunk size back together until the chunk size is reached;
- Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
- Excludes chunks consisting entirely of whitespace characters.
To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:
- The largest sequence of newlines (
\n) and/or carriage returns (\r); - The largest sequence of tabs;
- The largest sequence of whitespace characters (as defined by regex's
\scharacter class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters; - Sentence terminators (
.,?,!and*); - Clause separators (
;,,,(,),[,],",",',',',"and`); - Sentence interrupters (
:,โandโฆ); - Word joiners (
/,\,โ,&and-); and - All other characters.
If overlapping chunks have been requested, semchunk also:
- Internally reduces the chunk size to
min(overlap, chunk_size - overlap)(overlapbeing computed asfloor(chunk_size * overlap)for relative overlaps andmin(overlap, chunk_size - 1)for absolute overlaps); and - Merges every
floor(original_chunk_size / reduced_chunk_size)chunks starting from the first chunk and then jumping byfloor((original_chunk_size - overlap) / reduced_chunk_size)chunks until the last chunk is reached.
The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
This gem includes example scripts that demonstrate various features:
# Basic usage examples
ruby examples/basic_usage.rb
# Advanced usage with longer documents
ruby examples/advanced_usage.rbYou can run the included benchmark to test performance:
ruby test/bench.rbThe Ruby implementation maintains similar performance characteristics to the Python version:
- Efficient binary search for optimal split points
- O(n log n) complexity for chunking
- Fast token count lookups with memoization
- Low memory overhead
The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
This Ruby port maintains feature parity with the Python version, with a few notes:
- Multiprocessing support is not yet implemented (
processesparameter) - Progress bar support is not yet implemented (
progressparameter) - String tokenizer names (like
"gpt-4") are not yet supported - Otherwise, the API and behavior match the Python version
See MIGRATION.md for a detailed guide on migrating from the Python version.
If you want to report a bug, or have ideas, feedback or questions about the gem, let me know via GitHub issues and I will do my best to provide a helpful answer. Happy hacking!
The gem is available as open source under the terms of the MIT License.
Everyone interacting in this projectโs codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.
Pull requests are welcome!