github
diff --git a/‎README.md
Lines changed: 1 addition & 0 deletions b/‎README.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎crates/string-offsets/Cargo.toml
Lines changed: 14 additions & 0 deletions b/‎crates/string-offsets/Cargo.toml
Lines changed: 14 additions & 0 deletions
diff --git a/‎crates/string-offsets/README.md
Lines changed: 45 additions & 0 deletions b/‎crates/string-offsets/README.md
Lines changed: 45 additions & 0 deletions
@@ -4,6 +4,7 @@ A collection of useful algorithms written in Rust. Currently contains:
 
 - [`geo_filters`](crates/geo_filters): probabilistic data structures that solve the [Distinct Count Problem](https://en.wikipedia.org/wiki/Count-distinct_problem) using geometric filters.
 - [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
+- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.
 
 ## Background
 
 
@@ -0,0 +1,14 @@
+[package]
+name = "string-offsets"
+authors = ["The blackbird team <[email protected]>"]
+version = "0.1.0"
+edition = "2021"
+description = "Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines."
+repository = "https://github.com/github/rust-gems"
+license = "MIT"
+keywords = ["unicode", "positions", "utf16", "characters", "lines"]
+categories = ["algorithms", "data-structures", "text-processing", "development-tools::ffi"]
+
+[dev-dependencies]
+rand = "0.8"
+rand_chacha = "0.3"
@@ -0,0 +1,45 @@
+# string-offsets
+
+Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
+
+Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
+Unicode code points. It's therefore necessary to adjust string offsets when communicating across
+programming language boundaries. [`StringOffsets`] does these adjustments.
+
+Each `StringOffsets` instance contains offset information for a single string. [Building the data
+structure](StringOffsets::new) takes O(n) time and memory, but then most conversions are O(1).
+
+["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a
+blog post explaining the implementation.
+
+## Usage
+
+Add this to your `Cargo.toml`:
+
+```toml
+[dependencies]
+string-offsets = "0.1"
+```
+
+Then:
+
+```rust
+use string_offsets::StringOffsets;
+
+let s = "☀️hello\n🗺️world\n";
+let offsets = StringOffsets::new(s);
+
+// Find offsets where lines begin and end.
+assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers
+
+// Translate string offsets between UTF-8 and other encodings.
+// This map emoji is 7 UTF-8 bytes...
+assert_eq!(&s[12..19], "🗺️");
+// ...but only 3 UTF-16 code units...
+assert_eq!(offsets.utf8_to_utf16(12), 8);
+assert_eq!(offsets.utf8_to_utf16(19), 11);
+// ...and only 2 Unicode characters.
+assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
+```
+
+See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more.