Skip to content

Commit cab0842

Browse files
authored
Merge pull request #38 from github/jorendorff/utf8-converter
WIP: Add `string-offsets`
2 parents ace76c1 + a2735a4 commit cab0842

File tree

5 files changed

+1037
-0
lines changed

5 files changed

+1037
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@ A collection of useful algorithms written in Rust. Currently contains:
44

55
- [`geo_filters`](crates/geo_filters): probabilistic data structures that solve the [Distinct Count Problem](https://en.wikipedia.org/wiki/Count-distinct_problem) using geometric filters.
66
- [`bpe`](crates/bpe): fast, correct, and novel algorithms for the [Byte Pair Encoding Algorithm](https://en.wikipedia.org/wiki/Large_language_model#BPE) which are particularly useful for chunking of documents.
7+
- [`string-offsets`](crates/string-offsets): converts string positions between bytes, chars, UTF-16 code units, and line numbers. Useful when sending string indices across language boundaries.
78

89
## Background
910

crates/string-offsets/Cargo.toml

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
[package]
2+
name = "string-offsets"
3+
authors = ["The blackbird team <[email protected]>"]
4+
version = "0.1.0"
5+
edition = "2021"
6+
description = "Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines."
7+
repository = "https://github.com/github/rust-gems"
8+
license = "MIT"
9+
keywords = ["unicode", "positions", "utf16", "characters", "lines"]
10+
categories = ["algorithms", "data-structures", "text-processing", "development-tools::ffi"]
11+
12+
[dev-dependencies]
13+
rand = "0.8"
14+
rand_chacha = "0.3"

crates/string-offsets/README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
# string-offsets
2+
3+
Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
4+
5+
Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
6+
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
7+
programming language boundaries. [`StringOffsets`] does these adjustments.
8+
9+
Each `StringOffsets` instance contains offset information for a single string. [Building the data
10+
structure](StringOffsets::new) takes O(n) time and memory, but then most conversions are O(1).
11+
12+
["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a
13+
blog post explaining the implementation.
14+
15+
## Usage
16+
17+
Add this to your `Cargo.toml`:
18+
19+
```toml
20+
[dependencies]
21+
string-offsets = "0.1"
22+
```
23+
24+
Then:
25+
26+
```rust
27+
use string_offsets::StringOffsets;
28+
29+
let s = "☀️hello\n🗺️world\n";
30+
let offsets = StringOffsets::new(s);
31+
32+
// Find offsets where lines begin and end.
33+
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
34+
35+
// Translate string offsets between UTF-8 and other encodings.
36+
// This map emoji is 7 UTF-8 bytes...
37+
assert_eq!(&s[12..19], "🗺️");
38+
// ...but only 3 UTF-16 code units...
39+
assert_eq!(offsets.utf8_to_utf16(12), 8);
40+
assert_eq!(offsets.utf8_to_utf16(19), 11);
41+
// ...and only 2 Unicode characters.
42+
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
43+
```
44+
45+
See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more.

0 commit comments

Comments
 (0)