Multibyte modifiers and high-byte sets

mnemnion · mnemnion · commit 1ad1d12a48be · 2024-11-30T20:43:25.000-05:00
This release makes regexen like `λ+` match `"λλλλ"`.  It also adds high-
byte character classes, using the `\x` escaping syntax, which, with some
care, allows construction of useful character sets of _reasonably_ dense
codepoints.
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 Finding myself in need of a regular expressions library for a Zig project, and needing it to build regex at runtime, not just comptime, I ended up speedrunning a little library for just that purpose.
 
-This is that library.  It's a simple bytecode-based Commander Pike-style VM.  Just under 1500 lines of load-bearing code, no dependencies other than `std`.
+This is that library.  It's a simple bytecode-based Commander Pike-style VM.  Under 2000 lines of load-bearing code, no dependencies other than `std`.
 
 The provided Regex type allows 64 'operations' and 8 unique ASCII character sets.  If you would like more, or less, you can call `SizedRegex(num_ops, num_sets)` to customize the type.
 
@@ -11,7 +11,7 @@ The provided Regex type allows 64 'operations' and 8 unique ASCII character sets
 Drop the file into your project, or use the Zig build system:
 
 ```zig
-zig fetch --save "https://github.com/mnemnion/mvzr/archive/refs/tags/v0.2.5.tar.gz"
+zig fetch --save "https://github.com/mnemnion/mvzr/archive/refs/tags/v0.3.0.tar.gz"
 ```
 
 I'll do my best to keep that URL fresh, but it pays to check over here: ➔
@@ -38,13 +38,17 @@ For the latest release version.
 
 ## Limitations and Quirks
 
-- No Unicode support to speak of
+- Minimal multibyte / Unicode support
+    - This has improved somewhat.  A regex like `λ?` now matches an optional lambda, not just
+      an optional final byte.  Additionally, ranges of bytes greater than 0x7f are now supported,
+      this (with some care) can match certain sets: for instance `(\xce[\x91-\xa9])+` will match
+      a string of uppercase Greek letters, `\xc2[\x80-\x9f]` matches a C1 control code, and so on.
+      But you'll still need to work at the byte level, and use `\x` format, to do these tasks.
 - No fancy modifiers (you want case-insensitive, great, lowercase your string)
 - `.` matches any one byte.  `[^\n\r]` works fine if that's not what you want
     - Or split into lines first, divide and conquer
     - Note: `$` permits a final newline, but `^` must be the beginning of a string, and `$` _only_ matches a final newline.
-- Backtracks (sorry. For this to work without backtracking, we need async back)
-    - Preliminary tests indicate that this backtracking is non-catastrophic
+- Backtracks (sorry. For this design to work without backtracking, we need async back)
 - Compiler does some best-effort validation but I haven't really pounded on it
 - No capture groups.  Divide and conquer
 
diff --git a/build.zig.zon b/build.zig.zon
@@ -7,7 +7,7 @@
 
     // This is a [Semantic Version](https://semver.org/).
     // In a future version of Zig it will be used for package deduplication.
-    .version = "0.2.5",
+    .version = "0.3.0",
 
     // This field is optional.
     // This is currently advisory only; Zig does not yet do anything
diff --git a/src/mvzr.zig b/src/mvzr.zig