Skip to content

Commit 1ad1d12

Browse files
committed
Multibyte modifiers and high-byte sets
This release makes regexen like `λ+` match `"λλλλ"`. It also adds high- byte character classes, using the `\x` escaping syntax, which, with some care, allows construction of useful character sets of _reasonably_ dense codepoints.
1 parent 1c09e09 commit 1ad1d12

File tree

3 files changed

+269
-100
lines changed

3 files changed

+269
-100
lines changed

README.md

+9-5
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Finding myself in need of a regular expressions library for a Zig project, and needing it to build regex at runtime, not just comptime, I ended up speedrunning a little library for just that purpose.
44

5-
This is that library. It's a simple bytecode-based Commander Pike-style VM. Just under 1500 lines of load-bearing code, no dependencies other than `std`.
5+
This is that library. It's a simple bytecode-based Commander Pike-style VM. Under 2000 lines of load-bearing code, no dependencies other than `std`.
66

77
The provided Regex type allows 64 'operations' and 8 unique ASCII character sets. If you would like more, or less, you can call `SizedRegex(num_ops, num_sets)` to customize the type.
88

@@ -11,7 +11,7 @@ The provided Regex type allows 64 'operations' and 8 unique ASCII character sets
1111
Drop the file into your project, or use the Zig build system:
1212

1313
```zig
14-
zig fetch --save "https://github.com/mnemnion/mvzr/archive/refs/tags/v0.2.5.tar.gz"
14+
zig fetch --save "https://github.com/mnemnion/mvzr/archive/refs/tags/v0.3.0.tar.gz"
1515
```
1616

1717
I'll do my best to keep that URL fresh, but it pays to check over here: ➔
@@ -38,13 +38,17 @@ For the latest release version.
3838

3939
## Limitations and Quirks
4040

41-
- No Unicode support to speak of
41+
- Minimal multibyte / Unicode support
42+
- This has improved somewhat. A regex like `λ?` now matches an optional lambda, not just
43+
an optional final byte. Additionally, ranges of bytes greater than 0x7f are now supported,
44+
this (with some care) can match certain sets: for instance `(\xce[\x91-\xa9])+` will match
45+
a string of uppercase Greek letters, `\xc2[\x80-\x9f]` matches a C1 control code, and so on.
46+
But you'll still need to work at the byte level, and use `\x` format, to do these tasks.
4247
- No fancy modifiers (you want case-insensitive, great, lowercase your string)
4348
- `.` matches any one byte. `[^\n\r]` works fine if that's not what you want
4449
- Or split into lines first, divide and conquer
4550
- Note: `$` permits a final newline, but `^` must be the beginning of a string, and `$` _only_ matches a final newline.
46-
- Backtracks (sorry. For this to work without backtracking, we need async back)
47-
- Preliminary tests indicate that this backtracking is non-catastrophic
51+
- Backtracks (sorry. For this design to work without backtracking, we need async back)
4852
- Compiler does some best-effort validation but I haven't really pounded on it
4953
- No capture groups. Divide and conquer
5054

build.zig.zon

+1-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
// This is a [Semantic Version](https://semver.org/).
99
// In a future version of Zig it will be used for package deduplication.
10-
.version = "0.2.5",
10+
.version = "0.3.0",
1111

1212
// This field is optional.
1313
// This is currently advisory only; Zig does not yet do anything

0 commit comments

Comments
 (0)