Skip to content

Why does iteration with bytes::Regex yield empty matches that can split a codepoint, even when Unicode mode is enabled? #1276

Answered by BurntSushi
IsaacOscar asked this question in Q&A
Discussion options

You must be logged in to vote

This is very intentional behavior, although I note that it isn't documented. A great deal of care and attention is paid to this in the implementation inside of regex-automata. Notably, the semantics you want are called "UTF-8 mode," which is distinct and orthogonal from Unicode mode. UTF-8 mode has to do with guaranteeing that match spans always fall on UTF-8 boundaries. Unicode mode has to do with whether the regex pattern has Unicode features available. For example, see:

And special handling of empty matches when UTF-8 mode is enabled versus not:

#[inline]

Replies: 2 comments 2 replies

Comment options

You must be logged in to vote
1 reply
@BurntSushi
Comment options

Comment options

You must be logged in to vote
1 reply
@IsaacOscar
Comment options

Answer selected by BurntSushi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants
Converted from issue

This discussion was converted from issue #1275 on August 05, 2025 14:04.