This directory contains a Python runner program for benchmarking both Python's
standard library re module and the third party regex
module.
The Python runner program takes advantage of the fact that the API of the third
party regex module is drop-in compatible with the re module. So all it does
is import regex as re instead of import re when one is benchmarking it
for the regex module.
Otherwise, the runner program makes the following decisions:
- Only one pattern is supported.
- When the
regexmodule is used,regex.DEFAULT_VERSIONis set toregex.VERSION1. This is done because it has better Unicode support, and is presumably the more interesting thing to measure. - When Unicode mode is enabled, the runner program reports an error if the haystack is invalid UTF-8. (See below.)
Both the re and regex modules have two different types of regexes: ASCII
regexes and Unicode regexes. ASCII regexes can be built from either byte
string patterns or Unicode string patterns, and an ASCII regex can only search
haystacks corresponding to the same type of its pattern. Conversely, Unicode
regexes can only be built from Unicode strings, and Unicode regexes can only
search Unicode strings.
Since Unicode strings cannot be non-lossily constructed from invalid UTF-8,
it follows that neither the re nor the regex modules can search invalid
UTF-8 while Unicode mode is enabled. Thus, this is why enabling Unicode mode
for these regex engines requires the haystack to be valid UTF-8.
The "use bytes or Unicode" split actually infects pretty much everything about the regex APIs in Python-land. Once you make your choice about the type of your pattern, everything you then pass into the regex engine must have the same type.