semchunk 🧩

semchunk is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.

This is a Ruby port of the Python semchunk library by Isaacus, maintaining the same efficient chunking algorithm and API design.

semchunk produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.

Features

Semantic chunking: Splits text at natural boundaries (sentences, paragraphs, etc.) rather than at arbitrary character positions
Token-aware: Respects token limits from any tokenizer you provide
Overlap support: Create overlapping chunks for better context preservation
Offset tracking: Get the original positions of each chunk in the source text
Flexible: Works with any token counter (word count, character count, or tokenizers)
Memoization: Optional caching of token counts for improved performance

Installation
Quick start
API Reference
Examples
Support
License
Code of conduct
Contribution guide

Installation

Add this line to your application's Gemfile:

gem 'semchunk'

Or install it directly:

gem install semchunk

Quick start

require "semchunk"

# Define a simple token counter (or use a real tokenizer)
token_counter = ->(text) { text.split.length }

# Chunk some text
text = "This is the first sentence. This is the second sentence. And this is the third sentence."
chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)

puts chunks.inspect
# => ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]

API Reference

`Semchunk.chunk`

Split a text into semantically meaningful chunks.

Semchunk.chunk(
  text,
  chunk_size:,
  token_counter:,
  memoize: true,
  offsets: false,
  overlap: nil,
  cache_maxsize: nil
)

Parameters:

text (String): The text to be chunked
chunk_size (Integer): The maximum number of tokens a chunk may contain
token_counter (Proc, Lambda, Method): A callable that takes a string and returns the number of tokens in it
memoize (Boolean, optional): Whether to memoize the token counter. Defaults to true
offsets (Boolean, optional): Whether to return the start and end offsets of each chunk. Defaults to false
overlap (Float, Integer, nil, optional): The proportion of the chunk size (if < 1), or the number of tokens (if >= 1), by which chunks should overlap. Defaults to nil
cache_maxsize (Integer, nil, optional): The maximum number of text-token count pairs to cache. Defaults to nil (unbounded)

Returns:

Array<String> if offsets: false: List of text chunks
[Array<String>, Array<Array<Integer>>] if offsets: true: List of chunks and their [start, end] offsets

`Semchunk.chunkerify`

Create a reusable chunker object.

Semchunk.chunkerify(
  tokenizer_or_token_counter,
  chunk_size: nil,
  max_token_chars: nil,
  memoize: true,
  cache_maxsize: nil
)

Parameters:

tokenizer_or_token_counter: A tokenizer object with an encode method, or a callable token counter
chunk_size (Integer, nil): Maximum tokens per chunk. If nil, will attempt to use tokenizer's model_max_length
max_token_chars (Integer, nil): Maximum characters per token (optimization parameter)
memoize (Boolean): Whether to cache token counts. Defaults to true
cache_maxsize (Integer, nil): Cache size limit. Defaults to nil (unbounded)

Returns:

Semchunk::Chunker: A chunker instance

`Chunker#call`

Process text(s) with the chunker.

chunker.call(
  text_or_texts,
  processes: 1,
  progress: false,
  offsets: false,
  overlap: nil
)

Parameters:

text_or_texts (String, Array): Single text or array of texts to chunk
processes (Integer): Number of processes for parallel chunking (not yet implemented)
progress (Boolean): Show progress bar for multiple texts (not yet implemented)
offsets (Boolean): Return offset information
overlap (Float, Integer, nil): Overlap configuration

Returns:

For single text: Array<String> or [Array<String>, Array<Array<Integer>>]
For multiple texts: Array<Array<String>> or [Array<Array<String>>, Array<Array<Array<Integer>>>]

Examples

Basic Chunking

require "semchunk"

text = "Natural language processing is fascinating. It allows computers to understand human language. This enables many applications."

# Use word count as token counter
token_counter = ->(text) { text.split.length }

chunks = Semchunk.chunk(text, chunk_size: 8, token_counter: token_counter)

chunks.each_with_index do |chunk, i|
  puts "Chunk #{i + 1}: #{chunk}"
end
# => Chunk 1: Natural language processing is fascinating. It allows computers
# => Chunk 2: to understand human language. This enables many applications.

With Offsets

Track where each chunk came from in the original text:

text = "First paragraph here. Second paragraph here. Third paragraph here."
token_counter = ->(text) { text.split.length }

chunks, offsets = Semchunk.chunk(
  text,
  chunk_size: 5,
  token_counter: token_counter,
  offsets: true
)

chunks.zip(offsets).each do |chunk, (start_pos, end_pos)|
  puts "Chunk: '#{chunk}'"
  puts "Position: #{start_pos}...#{end_pos}"
  puts "Verification: '#{text[start_pos...end_pos]}'"
  puts
end

With Overlap

Create overlapping chunks to maintain context:

text = "One two three four five six seven eight nine ten."
token_counter = ->(text) { text.split.length }

# 50% overlap
chunks = Semchunk.chunk(
  text,
  chunk_size: 4,
  token_counter: token_counter,
  overlap: 0.5
)

puts "Overlapping chunks:"
chunks.each { |chunk| puts "- #{chunk}" }

# Fixed overlap of 2 tokens
chunks = Semchunk.chunk(
  text,
  chunk_size: 6,
  token_counter: token_counter,
  overlap: 2
)

puts "\nWith 2-token overlap:"
chunks.each { |chunk| puts "- #{chunk}" }

Using Chunkerify for Reusable Chunkers

# Create a chunker once
token_counter = ->(text) { text.split.length }
chunker = Semchunk.chunkerify(token_counter, chunk_size: 10)

# Use it multiple times
texts = [
  "First document to process.",
  "Second document to process.",
  "Third document to process."
]

all_chunks = chunker.call(texts)

all_chunks.each_with_index do |chunks, i|
  puts "Document #{i + 1} chunks: #{chunks.inspect}"
end

Character-Level Chunking

text = "abcdefghijklmnopqrstuvwxyz"

# Character count as token counter
token_counter = ->(text) { text.length }

chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)

puts chunks.inspect
# => ["abcde", "fghij", "klmno", "pqrst", "uvwxy", "z"]

Custom Token Counter

# Token counter that counts punctuation as separate tokens
def custom_token_counter(text)
  text.scan(/\w+|[^\w\s]/).length
end

text = "Hello, world! How are you?"

chunks = Semchunk.chunk(
  text,
  chunk_size: 5,
  token_counter: method(:custom_token_counter)
)

puts chunks.inspect

Working with Real Tokenizers

If you have a tokenizer that implements an encode method:

# Example with a hypothetical tokenizer
class MyTokenizer
  def encode(text, add_special_tokens: true)
    # Your tokenization logic here
    text.split.map { |word| word.hash }
  end
  
  def model_max_length
    512
  end
end

tokenizer = MyTokenizer.new

# chunkerify will automatically extract the token counter
chunker = Semchunk.chunkerify(tokenizer, chunk_size: 100)

text = "Your long text here..."
chunks = chunker.call(text)

How It Works 🔍

semchunk works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:

Splits text using the most semantically meaningful splitter possible;
Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
Merges any chunks that are under the chunk size back together until the chunk size is reached;
Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
Excludes chunks consisting entirely of whitespace characters.

To ensure that chunks are as semantically meaningful as possible, semchunk uses the following splitters, in order of precedence:

The largest sequence of newlines (\n) and/or carriage returns (\r);
The largest sequence of tabs;
The largest sequence of whitespace characters (as defined by regex's \s character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
Sentence terminators (., ?, ! and *);
Clause separators (;, ,, (, ), [, ], ", ", ', ', ', " and `);
Sentence interrupters (:, — and …);
Word joiners (/, \, –, & and -); and
All other characters.

If overlapping chunks have been requested, semchunk also:

Internally reduces the chunk size to min(overlap, chunk_size - overlap) (overlap being computed as floor(chunk_size * overlap) for relative overlaps and min(overlap, chunk_size - 1) for absolute overlaps); and
Merges every floor(original_chunk_size / reduced_chunk_size) chunks starting from the first chunk and then jumping by floor((original_chunk_size - overlap) / reduced_chunk_size) chunks until the last chunk is reached.

The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.

Running the Examples

This gem includes example scripts that demonstrate various features:

# Basic usage examples
ruby examples/basic_usage.rb

# Advanced usage with longer documents
ruby examples/advanced_usage.rb

Benchmarks 📊

You can run the included benchmark to test performance:

ruby test/bench.rb

The Ruby implementation maintains similar performance characteristics to the Python version:

Efficient binary search for optimal split points
O(n log n) complexity for chunking
Fast token count lookups with memoization
Low memory overhead

The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.

Differences from Python Version

This Ruby port maintains feature parity with the Python version, with a few notes:

Multiprocessing support is not yet implemented (processes parameter)
Progress bar support is not yet implemented (progress parameter)
String tokenizer names (like "gpt-4") are not yet supported
Otherwise, the API and behavior match the Python version

See MIGRATION.md for a detailed guide on migrating from the Python version.

Support

If you want to report a bug, or have ideas, feedback or questions about the gem, let me know via GitHub issues and I will do my best to provide a helpful answer. Happy hacking!

License

The gem is available as open source under the terms of the MIT License.

Code of conduct

Everyone interacting in this project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

Contribution guide

Pull requests are welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
bin		bin
examples		examples
lib		lib
test		test
.gitignore		.gitignore
.kodiak.toml		.kodiak.toml
.overcommit.yml		.overcommit.yml
.prettierignore		.prettierignore
.rubocop.yml		.rubocop.yml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Gemfile		Gemfile
LICENSE.txt		LICENSE.txt
MIGRATION.md		MIGRATION.md
PORT_SUMMARY.md		PORT_SUMMARY.md
PYTHON_PORT_UPDATE.md		PYTHON_PORT_UPDATE.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
Rakefile		Rakefile
semchunk.gemspec		semchunk.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

semchunk 🧩

Features

Installation

Quick start

API Reference

`Semchunk.chunk`

`Semchunk.chunkerify`

`Chunker#call`

Examples

Basic Chunking

With Offsets

With Overlap

Using Chunkerify for Reusable Chunkers

Character-Level Chunking

Custom Token Counter

Working with Real Tokenizers

How It Works 🔍

Running the Examples

Benchmarks 📊

Differences from Python Version

Support

License

Code of conduct

Contribution guide

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

philip-zhan/semchunk.rb

Folders and files

Latest commit

History

Repository files navigation

semchunk 🧩

Features

Installation

Quick start

API Reference

Semchunk.chunk

Semchunk.chunkerify

Chunker#call

Examples

Basic Chunking

With Offsets

With Overlap

Using Chunkerify for Reusable Chunkers

Character-Level Chunking

Custom Token Counter

Working with Real Tokenizers

How It Works 🔍

Running the Examples

Benchmarks 📊

Differences from Python Version

Support

License

Code of conduct

Contribution guide

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

`Semchunk.chunk`

`Semchunk.chunkerify`

`Chunker#call`

Packages