Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.
Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
programming language boundaries. StringOffsets
does these adjustments.
Each StringOffsets
instance contains offset information for a single string. Building the data
structure takes O(n) time and memory, but then most conversions are O(1).
"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.
Add this to your Cargo.toml
:
[dependencies]
string-offsets = "0.1"
Then:
use string_offsets::StringOffsets;
let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);
// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
See the documentation for more.