Skip to content

Commit

Permalink
Merge pull request #38 from github/jorendorff/utf8-converter
Browse files Browse the repository at this point in the history
WIP: Add `string-offsets`
  • Loading branch information
jorendorff authored Nov 13, 2024
2 parents ace76c1 + a2735a4 commit cab0842
Show file tree
Hide file tree
Showing 5 changed files with 1,037 additions and 0 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ A collection of useful algorithms written in Rust. Currently contains:

- [`geo_filters`](crates/geo_filters): probabilistic data structures that solve the [Distinct Count Problem](https://en.wikipedia.org/wiki/Count-distinct_problem) using geometric filters.
- [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.

## Background

Expand Down
14 changes: 14 additions & 0 deletions crates/string-offsets/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
[package]
name = "string-offsets"
authors = ["The blackbird team <[email protected]>"]
version = "0.1.0"
edition = "2021"
description = "Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines."
repository = "https://github.com/github/rust-gems"
license = "MIT"
keywords = ["unicode", "positions", "utf16", "characters", "lines"]
categories = ["algorithms", "data-structures", "text-processing", "development-tools::ffi"]

[dev-dependencies]
rand = "0.8"
rand_chacha = "0.3"
45 changes: 45 additions & 0 deletions crates/string-offsets/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
programming language boundaries. [`StringOffsets`] does these adjustments.

Each `StringOffsets` instance contains offset information for a single string. [Building the data
structure](StringOffsets::new) takes O(n) time and memory, but then most conversions are O(1).

["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a
blog post explaining the implementation.

## Usage

Add this to your `Cargo.toml`:

```toml
[dependencies]
string-offsets = "0.1"
```

Then:

```rust
use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
```

See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more.
Loading

0 comments on commit cab0842

Please sign in to comment.