Name		Name	Last commit message	Last commit date
parent directory ..
src		src
Cargo.toml		Cargo.toml
README.md		README.md

README.md

string-offsets

Converts string offsets between UTF-8 bytes, UTF-16 code units, Unicode code points, and lines.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of Unicode code points. It's therefore necessary to adjust string offsets when communicating across programming language boundaries. StringOffsets does these adjustments.

Each StringOffsets instance contains offset information for a single string. Building the data structure takes O(n) time and memory, but then most conversions are O(1).

"UTF-8 Conversions with BitRank" is a blog post explaining the implementation.

Usage

Add this to your Cargo.toml:

[dependencies]
string-offsets = "0.1"

Then:

use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12);  // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);

See the documentation for more.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

string-offsets

string-offsets

README.md

string-offsets

Usage

Files

string-offsets

Directory actions

More options

Directory actions

More options

Latest commit

History

string-offsets

Folders and files

parent directory

README.md

string-offsets

Usage