Skip to content

Commit

Permalink
Flesh out string-offsets README
Browse files Browse the repository at this point in the history
  • Loading branch information
jorendorff committed Nov 12, 2024
1 parent fd056fb commit ef32575
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 1 deletion.
35 changes: 34 additions & 1 deletion crates/string-offsets/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
# string-offsets

This crate converts string positions between Rust style (UTF-8 byte offsets) and styles used by other programming languages, as well as line numbers.
Offset calculator to convert between byte, char, and line offsets in a string.

Rust strings are UTF-8, but JavaScript has UTF-16 strings, and in Python, strings are sequences of
Unicode code points. It's therefore necessary to adjust string offsets when communicating across
programming language boundaries. [`StringOffsets`] does these adjustments.

Each `StringOffsets` value contains offset information for a single string. [Building the data
structure](StringOffsets::new) takes O(n) time and memory, but then each conversion is fast.

["UTF-8 Conversions with BitRank"](https://adaptivepatchwork.com/2023/07/10/utf-conversion/) is a
blog post explaining the implementation.

## Usage

Expand All @@ -10,3 +20,26 @@ Add this to your `Cargo.toml`:
[dependencies]
string-offsets = "0.1"
```

Then:

```rust
use string_offsets::StringOffsets;

let s = "☀️hello\n🗺️world\n";
let offsets = StringOffsets::new(s);

// Find offsets where lines begin and end.
assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers

// Translate string offsets between UTF-8 and other encodings.
// This map emoji is 7 UTF-8 bytes...
assert_eq!(&s[12..19], "🗺️");
// ...but only 3 UTF-16 code units...
assert_eq!(offsets.utf8_to_utf16(12), 8);
assert_eq!(offsets.utf8_to_utf16(19), 11);
// ...and only 2 Unicode characters.
assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
```

See [the documentation](https://docs.rs/string-offsets/latest/string_offsets/struct.StringOffsets.html) for more.
22 changes: 22 additions & 0 deletions crates/string-offsets/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
//! Offset calculator to convert between byte, char, and line offsets in a string.
//!
//!
//! # Example
//!
//! ```
//! use string_offsets::StringOffsets;
//!
//! let s = "☀️hello\n🗺️world\n";
//! let offsets = StringOffsets::new(s);
//!
//! // Find offsets where lines begin and end.
//! assert_eq!(offsets.line_to_utf8s(0), 0..12); // note: 0-based line numbers
//!
//! // Translate string offsets between UTF-8 and other encodings.
//! // This map emoji is 7 UTF-8 bytes...
//! assert_eq!(&s[12..19], "🗺️");
//! // ...but only 3 UTF-16 code units...
//! assert_eq!(offsets.utf8_to_utf16(12), 8);
//! assert_eq!(offsets.utf8_to_utf16(19), 11);
//! // ...and only 2 Unicode characters.
//! assert_eq!(offsets.utf8s_to_chars(12..19), 8..10);
//! ```
//!
//! See [`StringOffsets`] for details.
#![deny(missing_docs)]

Expand Down

0 comments on commit ef32575

Please sign in to comment.