Skip to content

philip-zhan/semchunk.rb

Repository files navigation

semchunk ๐Ÿงฉ

Gem Version Gem Downloads GitHub Workflow Status

semchunk is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.

This is a Ruby port of the Python semchunk library by Isaacus, maintaining the same efficient chunking algorithm and API design.

semchunk produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.

Features

  • Semantic chunking: Splits text at natural boundaries (sentences, paragraphs, etc.) rather than at arbitrary character positions
  • Token-aware: Respects token limits from any tokenizer you provide
  • Overlap support: Create overlapping chunks for better context preservation
  • Offset tracking: Get the original positions of each chunk in the source text
  • Flexible: Works with any token counter (word count, character count, or tokenizers)
  • Memoization: Optional caching of token counts for improved performance

Installation

Add this line to your application's Gemfile:

gem 'semchunk'

Or install it directly:

gem install semchunk

Quick start

require "semchunk"

# Define a simple token counter (or use a real tokenizer)
token_counter = ->(text) { text.split.length }

# Chunk some text
text = "This is the first sentence. This is the second sentence. And this is the third sentence."
chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)

puts chunks.inspect
# => ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]

API Reference

Semchunk.chunk

Split a text into semantically meaningful chunks.

Semchunk.chunk(
  text,
  chunk_size:,
  token_counter:,
  memoize: true,
  offsets: false,
  overlap: nil,
  cache_maxsize: nil
)

Parameters:

  • text (String): The text to be chunked
  • chunk_size (Integer): The maximum number of tokens a chunk may contain
  • token_counter (Proc, Lambda, Method): A callable that takes a string and returns the number of tokens in it
  • memoize (Boolean, optional): Whether to memoize the token counter. Defaults to true
  • offsets (Boolean, optional): Whether to return the start and end offsets of each chunk. Defaults to false
  • overlap (Float, Integer, nil, optional): The proportion of the chunk size (if < 1), or the number of tokens (if >= 1), by which chunks should overlap. Defaults to nil
  • cache_maxsize (Integer, nil, optional): The maximum number of text-token count pairs to cache. Defaults to nil (unbounded)

Returns:

  • Array<String> if offsets: false: List of text chunks
  • [Array<String>, Array<Array<Integer>>] if offsets: true: List of chunks and their [start, end] offsets

Semchunk.chunkerify

Create a reusable chunker object.

Semchunk.chunkerify(
  tokenizer_or_token_counter,
  chunk_size: nil,
  max_token_chars: nil,
  memoize: true,
  cache_maxsize: nil
)

Parameters:

  • tokenizer_or_token_counter: A tokenizer object with an encode method, or a callable token counter
  • chunk_size (Integer, nil): Maximum tokens per chunk. If nil, will attempt to use tokenizer's model_max_length
  • max_token_chars (Integer, nil): Maximum characters per token (optimization parameter)
  • memoize (Boolean): Whether to cache token counts. Defaults to true
  • cache_maxsize (Integer, nil): Cache size limit. Defaults to nil (unbounded)

Returns:

  • Semchunk::Chunker: A chunker instance

Chunker#call

Process text(s) with the chunker.

chunker.call(
  text_or_texts,
  processes: 1,
  progress: false,
  offsets: false,
  overlap: nil
)

Parameters:

  • text_or_texts (String, Array): Single text or array of texts to chunk
  • processes (Integer): Number of processes for parallel chunking (not yet implemented)
  • progress (Boolean): Show progress bar for multiple texts (not yet implemented)
  • offsets (Boolean): Return offset information
  • overlap (Float, Integer, nil): Overlap configuration

Returns:

  • For single text: Array<String> or [Array<String>, Array<Array<Integer>>]
  • For multiple texts: Array<Array<String>> or [Array<Array<String>>, Array<Array<Array<Integer>>>]

Examples

Basic Chunking

require "semchunk"

text = "Natural language processing is fascinating. It allows computers to understand human language. This enables many applications."

# Use word count as token counter
token_counter = ->(text) { text.split.length }

chunks = Semchunk.chunk(text, chunk_size: 8, token_counter: token_counter)

chunks.each_with_index do |chunk, i|
  puts "Chunk #{i + 1}: #{chunk}"
end
# => Chunk 1: Natural language processing is fascinating. It allows computers
# => Chunk 2: to understand human language. This enables many applications.

With Offsets

Track where each chunk came from in the original text:

text = "First paragraph here. Second paragraph here. Third paragraph here."
token_counter = ->(text) { text.split.length }

chunks, offsets = Semchunk.chunk(
  text,
  chunk_size: 5,
  token_counter: token_counter,
  offsets: true
)

chunks.zip(offsets).each do |chunk, (start_pos, end_pos)|
  puts "Chunk: '#{chunk}'"
  puts "Position: #{start_pos}...#{end_pos}"
  puts "Verification: '#{text[start_pos...end_pos]}'"
  puts
end

With Overlap

Create overlapping chunks to maintain context:

text = "One two three four five six seven eight nine ten."
token_counter = ->(text) { text.split.length }

# 50% overlap
chunks = Semchunk.chunk(
  text,
  chunk_size: 4,
  token_counter: token_counter,
  overlap: 0.5
)

puts "Overlapping chunks:"
chunks.each { |chunk| puts "- #{chunk}" }

# Fixed overlap of 2 tokens
chunks = Semchunk.chunk(
  text,
  chunk_size: 6,
  token_counter: token_counter,
  overlap: 2
)

puts "\nWith 2-token overlap:"
chunks.each { |chunk| puts "- #{chunk}" }

Using Chunkerify for Reusable Chunkers

# Create a chunker once
token_counter = ->(text) { text.split.length }
chunker = Semchunk.chunkerify(token_counter, chunk_size: 10)

# Use it multiple times
texts = [
  "First document to process.",
  "Second document to process.",
  "Third document to process."
]

all_chunks = chunker.call(texts)

all_chunks.each_with_index do |chunks, i|
  puts "Document #{i + 1} chunks: #{chunks.inspect}"
end

Character-Level Chunking

text = "abcdefghijklmnopqrstuvwxyz"

# Character count as token counter
token_counter = ->(text) { text.length }

chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)

puts chunks.inspect
# => ["abcde", "fghij", "klmno", "pqrst", "uvwxy", "z"]

Custom Token Counter

# Token counter that counts punctuation as separate tokens
def custom_token_counter(text)
  text.scan(/\w+|[^\w\s]/).length
end

text = "Hello, world! How are you?"

chunks = Semchunk.chunk(
  text,
  chunk_size: 5,
  token_counter: method(:custom_token_counter)
)

puts chunks.inspect

Working with Real Tokenizers

If you have a tokenizer that implements an encode method:

# Example with a hypothetical tokenizer
class MyTokenizer
  def encode(text, add_special_tokens: true)
    # Your tokenization logic here
    text.split.map { |word| word.hash }
  end
  
  def model_max_length
    512
  end
end

tokenizer = MyTokenizer.new

# chunkerify will automatically extract the token counter
chunker = Semchunk.chunkerify(tokenizer, chunk_size: 100)

text = "Your long text here..."
chunks = chunker.call(text)

How It Works ๐Ÿ”

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

  1. Splits text using the most semantically meaningful splitter possible;
  2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
  3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
  4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
  5. Excludes chunks consisting entirely of whitespace characters.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

  1. The largest sequence of newlines (\n) and/or carriage returns (\r);
  2. The largest sequence of tabs;
  3. The largest sequence of whitespace characters (as defined by regex's \s character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
  4. Sentence terminators (., ?, ! and *);
  5. Clause separators (;, ,, (, ), [, ], ", ", ', ', ', " and `);
  6. Sentence interrupters (:, โ€” and โ€ฆ);
  7. Word joiners (/, \, โ€“, & and -); and
  8. All other characters.

If overlapping chunks have been requested, semchunk also:

  1. Internally reduces the chunk size to min(overlap, chunk_size - overlap) (overlap being computed as floor(chunk_size * overlap) for relative overlaps and min(overlap, chunk_size - 1) for absolute overlaps); and
  2. Merges every floor(original_chunk_size / reduced_chunk_size) chunks starting from the first chunk and then jumping by floor((original_chunk_size - overlap) / reduced_chunk_size) chunks until the last chunk is reached.

The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.

Running the Examples

This gem includes example scripts that demonstrate various features:

# Basic usage examples
ruby examples/basic_usage.rb

# Advanced usage with longer documents
ruby examples/advanced_usage.rb

Benchmarks ๐Ÿ“Š

You can run the included benchmark to test performance:

ruby test/bench.rb

The Ruby implementation maintains similar performance characteristics to the Python version:

  • Efficient binary search for optimal split points
  • O(n log n) complexity for chunking
  • Fast token count lookups with memoization
  • Low memory overhead

The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.

Differences from Python Version

This Ruby port maintains feature parity with the Python version, with a few notes:

  • Multiprocessing support is not yet implemented (processes parameter)
  • Progress bar support is not yet implemented (progress parameter)
  • String tokenizer names (like "gpt-4") are not yet supported
  • Otherwise, the API and behavior match the Python version

See MIGRATION.md for a detailed guide on migrating from the Python version.

Support

If you want to report a bug, or have ideas, feedback or questions about the gem, let me know via GitHub issues and I will do my best to provide a helpful answer. Happy hacking!

License

The gem is available as open source under the terms of the MIT License.

Code of conduct

Everyone interacting in this projectโ€™s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

Contribution guide

Pull requests are welcome!