PikeVM: alternative prefix acceleration strategy#1339
Open
shilangyu wants to merge 5 commits intorust-lang:masterfrom
Open
PikeVM: alternative prefix acceleration strategy#1339shilangyu wants to merge 5 commits intorust-lang:masterfrom
shilangyu wants to merge 5 commits intorust-lang:masterfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a new opt-in strategy for performing prefix acceleration with prefilters in the PikeVM. In benchmarks it outperforms the existing strategy with speed ups of up to 20$\times$ . The new algorithm upholds the linear-time runtime complexity guarantee of this crate.
Idea
Currently the PikeVM simulates the
(?s:.)*?to find matches anywhere in the string. It does so by computing the epsilon closure of the initial NFA state at every input position. However, we could leverage the information from prefilters to avoid creating new threads at positions we know a match cannot exist.For example, the existing implementation does poorly with the regex
/ab*c/and input"abbbb...bbbb"(one a, many b's). The PikeVM currently creates new threads at every position, despite none of them leading to a match. The extracted literal for that regex isI("a"), which immediately tells us the only potential match can start at the very beginning of the input. In the new strategy, no new threads would be created at any of the positions of b's.The new strategy is better whenever looking for all matches. For finding the first match, it can potentially have drawbacks but is on average beneficial. It could be considered to make this the default strategy for match-all, and leaving it opt-in for find-first. I am happy to contribute this change of default.
Implementation
In the existing prefix acceleration we do a prefilter search whenever we are out of PikeVM states to explore. The new strategy runs the prefilter earlier, proactively determining the next matching position in advance. The advantage of doing this eagerly is that whenever in the PikeVM we simulate the
(?s:.)*?, we can skip this work at positions the prefilter did not mark. This is sound by the prefilter's correctness: if the prefilter didn't mark a position, no match can exist there. We rerun the prefilter whenever we are past the input position it matched last.A new config option controls the prefiltering strategy. The default remains the existing approach. We also cache the computed prefilter position between runs to avoid redundant work when finding all matches. The cache is invalidated in two cases: we are (1) searching a different haystack, or (2) searching the same haystack at an earlier position. Pointer equality determines haystack identity.
Benchmarks
To benchmark we use rebar (fork: https://github.com/LindenRegex/rebar/tree/mw/prefix-acc-bench) and enable three engines:
rust/regex/pikevm)rust/regex/pikevm/accEmptyStates)rust/regex/pikevm/accOneAhead)The first and second engine execute the same algorithm. Both were included to see if any regressions were introduced.
The new algorithm performs noticeably better, either matching or outperforming the existing strategy.
The full benchmarking results are here with the report here. A few benchmarks report a very large speed up (470-5800$\times$ ). This is due to rebar not clearing the cache between runs leading to searches immediately returning knowing that there is no match.
Drawbacks
The new approach does suffer from degenerative cases. Namely the new strategy loses the streaming property of matching with the PikeVM. The new strategy might look at characters which are after a potential match, something the existing strategy would not do. For example, with the haystack
"aabbbb...bbbaa"(two a's, many b's, two a's) and regex/aa/: the prefilter on literalI("aa")finds the first match at the beginning of the haystack. But the new strategy then searches again, finds the next"aa"at the end of the haystack. The entire haystack gets scanned despite a valid match existing at the start. The existing prefix acceleration strategy does not exhibit this issue.The rebar benchmarks do not showcase this problem, as they only test the match-all case (for which the new strategy is better). For find-first-match, the new strategy can still potentially perform better, through we do not have a comprehensive set of benchmarks to confirm this.
This effort was supervised by Aurèle Barrière and Clément Pit-Claudel at EPFL's SYSTEMF.