|
| 1 | +# string-offsets |
| 2 | + |
| 3 | +Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines. |
| 4 | + |
| 5 | +Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of |
| 6 | +Unicode code points. It's therefore necessary to adjust string offsets when communicating across |
| 7 | +programming language boundaries. [`StringOffsets`] does these adjustments. |
| 8 | + |
| 9 | +Each `StringOffsets` instance contains offset information for a single string. [Building the data |
| 10 | +structure](StringOffsets::new) takes O(n) time and memory, but then most conversions are O(1). |
| 11 | + |
| 12 | +["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a |
| 13 | +blog post explaining the implementation. |
| 14 | + |
| 15 | +## Usage |
| 16 | + |
| 17 | +Add this to your `Cargo.toml`: |
| 18 | + |
| 19 | +```toml |
| 20 | +[dependencies] |
| 21 | +string-offsets = "0.1" |
| 22 | +``` |
| 23 | + |
| 24 | +Then: |
| 25 | + |
| 26 | +```rust |
| 27 | +use string_offsets::StringOffsets; |
| 28 | + |
| 29 | +let s = "☀️hello\n🗺️world\n"; |
| 30 | +let offsets = StringOffsets::new(s); |
| 31 | + |
| 32 | +// Find offsets where lines begin and end. |
| 33 | +assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers |
| 34 | + |
| 35 | +// Translate string offsets between UTF-8 and other encodings. |
| 36 | +// This map emoji is 7 UTF-8 bytes... |
| 37 | +assert_eq!(&s[12..19], "🗺️"); |
| 38 | +// ...but only 3 UTF-16 code units... |
| 39 | +assert_eq!(offsets.utf8_to_utf16(12), 8); |
| 40 | +assert_eq!(offsets.utf8_to_utf16(19), 11); |
| 41 | +// ...and only 2 Unicode characters. |
| 42 | +assert_eq!(offsets.utf8s_to_chars(12..19), 8..10); |
| 43 | +``` |
| 44 | + |
| 45 | +See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more. |
0 commit comments