Skip to content

PikeVM: alternative prefix acceleration strategy#1339

Open
shilangyu wants to merge 5 commits intorust-lang:masterfrom
LindenRegex:mw/prefix-acc-cmp
Open

PikeVM: alternative prefix acceleration strategy#1339
shilangyu wants to merge 5 commits intorust-lang:masterfrom
LindenRegex:mw/prefix-acc-cmp

Conversation

@shilangyu
Copy link

This PR introduces a new opt-in strategy for performing prefix acceleration with prefilters in the PikeVM. In benchmarks it outperforms the existing strategy with speed ups of up to 20 $\times$. The new algorithm upholds the linear-time runtime complexity guarantee of this crate.

Idea

Currently the PikeVM simulates the (?s:.)*? to find matches anywhere in the string. It does so by computing the epsilon closure of the initial NFA state at every input position. However, we could leverage the information from prefilters to avoid creating new threads at positions we know a match cannot exist.

For example, the existing implementation does poorly with the regex /ab*c/ and input "abbbb...bbbb" (one a, many b's). The PikeVM currently creates new threads at every position, despite none of them leading to a match. The extracted literal for that regex is I("a"), which immediately tells us the only potential match can start at the very beginning of the input. In the new strategy, no new threads would be created at any of the positions of b's.

The new strategy is better whenever looking for all matches. For finding the first match, it can potentially have drawbacks but is on average beneficial. It could be considered to make this the default strategy for match-all, and leaving it opt-in for find-first. I am happy to contribute this change of default.

Implementation

In the existing prefix acceleration we do a prefilter search whenever we are out of PikeVM states to explore. The new strategy runs the prefilter earlier, proactively determining the next matching position in advance. The advantage of doing this eagerly is that whenever in the PikeVM we simulate the (?s:.)*?, we can skip this work at positions the prefilter did not mark. This is sound by the prefilter's correctness: if the prefilter didn't mark a position, no match can exist there. We rerun the prefilter whenever we are past the input position it matched last.

A new config option controls the prefiltering strategy. The default remains the existing approach. We also cache the computed prefilter position between runs to avoid redundant work when finding all matches. The cache is invalidated in two cases: we are (1) searching a different haystack, or (2) searching the same haystack at an earlier position. Pointer equality determines haystack identity.

Benchmarks

To benchmark we use rebar (fork: https://github.com/LindenRegex/rebar/tree/mw/prefix-acc-bench) and enable three engines:

  1. The original PikeVM implementation (called rust/regex/pikevm)
  2. The new PikeVM implementation with the strategy set to the existing one (called rust/regex/pikevm/accEmptyStates)
  3. The new PikeVM implementation with the strategy set to the new one (called rust/regex/pikevm/accOneAhead)

The first and second engine execute the same algorithm. Both were included to see if any regressions were introduced.

Engine Version Geometric mean of speed ratios Benchmark count
rust/regex/pikevm/accOneAhead 0.4.14 1.03 307
rust/regex/pikevm/accEmptyStates 0.4.14 1.22 307
rust/regex/pikevm 0.4.14 1.22 307

The new algorithm performs noticeably better, either matching or outperforming the existing strategy.

The full benchmarking results are here with the report here. A few benchmarks report a very large speed up (470-5800 $\times$). This is due to rebar not clearing the cache between runs leading to searches immediately returning knowing that there is no match.

Drawbacks

The new approach does suffer from degenerative cases. Namely the new strategy loses the streaming property of matching with the PikeVM. The new strategy might look at characters which are after a potential match, something the existing strategy would not do. For example, with the haystack "aabbbb...bbbaa" (two a's, many b's, two a's) and regex /aa/: the prefilter on literal I("aa") finds the first match at the beginning of the haystack. But the new strategy then searches again, finds the next "aa" at the end of the haystack. The entire haystack gets scanned despite a valid match existing at the start. The existing prefix acceleration strategy does not exhibit this issue.

The rebar benchmarks do not showcase this problem, as they only test the match-all case (for which the new strategy is better). For find-first-match, the new strategy can still potentially perform better, through we do not have a comprehensive set of benchmarks to confirm this.


This effort was supervised by Aurèle Barrière and Clément Pit-Claudel at EPFL's SYSTEMF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant