reflex report exceeds complexity limits #213

haihuayang · 2024-10-28T18:33:04Z

Hi,
I did some test to measure the performance of reflex, in the below test reflex is very slow and eventually report error "exceeds complexity limits"

%%
^.*0yGzqGtqP6C7WmFcWo4C.*58CZqoCgRH5f1SnTjoYc.*gmpl55gVTrpCmb9sGktn     |
^.*82lS7xASLLHEG7YYIbSm.*UTIFP0YeJE8pvumxDuQO.*VqROVkTHYUyxwGzPTEYP     |
^.*Dn1m5zW7AlYdZh2f1fXm.*zkctITsNzQNlAUYrIi1W.*AzeYznQqsWNjwW0cxKHN     |
^.*fKYQD1rZelcMTJviZ6n7.*ntI3oSSVDmOmvQFgmKx3.*OIATTMBy45kqhqDY9MSE     |
^.*txA1P3XxndVSEzEL2rWT.*pDPBeJhCirz2FabZjRG8.*21CNK5EXVf28bJRCvM4G     |
^.*ZqWAQvg9bGkbiAK3RNXS.*hLvkAp2cJbCDIHRXUH7J.*iPm9bqneWI1pxcbYSWtw     |
^.*dZZoCOpOXYhhRBExj1ED.*DuydMZl2LkdjNiGwAXd3.*Us6zpeA02jrVTP6tVNMp     return true;
.       return false;
%%

$ /usr/local/nutanix/minerva/bin/reflex --full   --outfile=test21.cpp test21.l
reflex: error: malformed regular expression
error at position 497
djNiGwAXd3.*Us6zpeA02jrVTP6tVNMp)|(.)
        exceeds complexity limits___/

but if i modify the rule to the below which is equivelent for this use case, then reflex completes successfully and much faster

%%
0yGzqGtqP6C7WmFcWo4C.*58CZqoCgRH5f1SnTjoYc.*gmpl55gVTrpCmb9sGktn        |
82lS7xASLLHEG7YYIbSm.*UTIFP0YeJE8pvumxDuQO.*VqROVkTHYUyxwGzPTEYP        |
Dn1m5zW7AlYdZh2f1fXm.*zkctITsNzQNlAUYrIi1W.*AzeYznQqsWNjwW0cxKHN        |
fKYQD1rZelcMTJviZ6n7.*ntI3oSSVDmOmvQFgmKx3.*OIATTMBy45kqhqDY9MSE        |
txA1P3XxndVSEzEL2rWT.*pDPBeJhCirz2FabZjRG8.*21CNK5EXVf28bJRCvM4G        |
ZqWAQvg9bGkbiAK3RNXS.*hLvkAp2cJbCDIHRXUH7J.*iPm9bqneWI1pxcbYSWtw        |
dZZoCOpOXYhhRBExj1ED.*DuydMZl2LkdjNiGwAXd3.*Us6zpeA02jrVTP6tVNMp        return true;
.       return false;
%%

I am wondering if this is a bug.

Thanks,
Haihua

The text was updated successfully, but these errors were encountered:

genivia-inc · 2024-10-28T19:21:08Z

Thanks, will take a look at this example.

Please note that DFA construction of regex patterns may in extreme pathological cases result in DFA state explosion where the number of states is exponentially large (search online for "DFA state explosion"). This is well known and theoretically proven, but practically it is unlikely to happen, unless for such pathological cases that are easy to find online.

When that happens, NFA is the better option and also supported by RE/flex using %o matcher=pcre2_perl or with command-line option -m pcre2_perl then link with PCRE2. Or use %o matcher=boost for POSIX matching with Boost.Regex. NFA matching is generally slower than DFA matching. Lex and Flex lexers are always POSIX matchers and DFA-based for speed, since DFA construction time is part of the compile time to create a lexer. Perl matching is not POSIX matching and not always suitable for lexical analysis, see the explanation in the RE/flex documentation.

So this is likely not a "bug" you found, but a "feature" that we can't fundamentally change when using DFAs for pattern matching.

I am working an a RE/flex update that will be released soon. The update speeds up searching with RE/flex as already implemented in ugrep v7, but that update won't impact the DFA construction process.

genivia-inc · 2024-10-28T19:52:27Z

To show why this is so, you can try the following yourself. Download reflex on a Linux box and build it with ./build.sh as per instructions. In the reflex/tests directory enter

./test.sh

Then type a regex, such as .*ab.*cd.*ef and .*ab.*cd.*ef|.*st.*uv.*wx for example, which are already large DFAs as the generated PDF will show (a dump.pdf is generated by the test.sh script). You can see that the DFA states and edges grow a lot to represent the pattern, which is due to the .* between the letter sequences. The longer the letter sequences are in the pattern, the exponentially more states are produced.

Now, DFA construction time depends on the method used e.g. Thompson construction or direct DFA construction, so there are subtle differences in time and size, but in the end the state explosion is something that no method can avoid. State minimization is possible, but that is something that is done after construction. Also, the DFAs for .*ab.*cd.*ef and .*ab.*cd.*ef|.*st.*uv.*wx are minimal as far as I can tell, so there is no possible size reduction.

genivia-inc · 2024-10-29T13:17:23Z

Happy to help you out, so let me suggest that this "test" is not a good one to test performance and there is a better way for this kind of regex problem.

Regex is not a magical tool that can do whatever you want and expect it to work fast. Regex is like writing a program. Some programs (regex) are just not so great to use as a "blunt hammer" and will perform lousy. In this example you have the .* separate patterns. That's not how you want to find these patterns and then check if the patterns found are on the same line. It also does not tell you the results accurately, because lines can have any of these combinations of three patterns that make up any of the seven patterns we look for.

Rather, you want to use regex to find patterns in the input and then combine the findings for each line to check what was found on that line. This is much faster and not hard to do, at least not with RE/flex using find to generate a search engine (instead of a tokenizer):

%{
#include <stdio.h>
#include <vector>
size_t last_lineno = 1;
std::vector<int> hits;
void found(size_t lineno, int num)
{
  if (lineno != last_lineno)
  {
    int mask[7] = {}; // 7 patterns of 3 strings
    // memset(mask, 0, sizeof(mask));
    for (std::vector<int>::iterator it = hits.begin(); it != hits.end(); ++it)
    {
      int n = *it - 1;
      int j = n / 3, k = n % 3;
      // check if we found previous strings, then set bit
      int p = ((1 << k) - 1);
      if ((mask[j] & p) == p)
        mask[j] |= 1 << k;
    }
    printf("%zu:", last_lineno);
    for (int i = 0; i < 7; ++i)
      if (mask[i] == 7)
        printf(" %d", i + 1);
    printf("\n");
    last_lineno = lineno;
    hits.clear();
  }
  hits.push_back(num);
}
%}
%option fast find flex noyywrap
%%
0yGzqGtqP6C7WmFcWo4C    { found(yylineno, 1); }
58CZqoCgRH5f1SnTjoYc    { found(yylineno, 2); }
gmpl55gVTrpCmb9sGktn    { found(yylineno, 3); }
82lS7xASLLHEG7YYIbSm    { found(yylineno, 4); }
UTIFP0YeJE8pvumxDuQO    { found(yylineno, 5); }
VqROVkTHYUyxwGzPTEYP    { found(yylineno, 6); }
Dn1m5zW7AlYdZh2f1fXm    { found(yylineno, 7); }
zkctITsNzQNlAUYrIi1W    { found(yylineno, 8); }
AzeYznQqsWNjwW0cxKHN    { found(yylineno, 9); }
fKYQD1rZelcMTJviZ6n7    { found(yylineno, 10); }
ntI3oSSVDmOmvQFgmKx3    { found(yylineno, 11); }
OIATTMBy45kqhqDY9MSE    { found(yylineno, 12); }
txA1P3XxndVSEzEL2rWT    { found(yylineno, 13); }
pDPBeJhCirz2FabZjRG8    { found(yylineno, 14); }
21CNK5EXVf28bJRCvM4G    { found(yylineno, 15); }
ZqWAQvg9bGkbiAK3RNXS    { found(yylineno, 16); }
hLvkAp2cJbCDIHRXUH7J    { found(yylineno, 17); }
iPm9bqneWI1pxcbYSWtw    { found(yylineno, 18); }
dZZoCOpOXYhhRBExj1ED    { found(yylineno, 19); }
DuydMZl2LkdjNiGwAXd3    { found(yylineno, 20); }
Us6zpeA02jrVTP6tVNMp    { found(yylineno, 21); }
<<EOF>>                 { found(yylineno, 0); return 0; }
%%
int main()
{
  return yyFlexLexer().yylex();
}

This uses a vector of ints where each int refers to a pattern found on a line. Then when we switch lines, we output the results of the previous line. A bit mask array is used to collect the three bits referring to the three strings of the 7 patterns. The first bits must be set to we make sure that we find the patterns in the order we need (thus ignoring patterns that are out of order).

Benchmarks are only useful if they tell you what to expect in general for common regex pattern search problems, not for problems for which regex in that specific way to test are no good. One can choose DFA or NFA matching to see what method works faster, but that is pointless because we know these differences very well (DFA takes more time to construct, NFA takes more time to match input) so it's not surprising.

EDIT: to show how fast this is, let me demonstrate on a 100MB file. It takes only 80ms to search it:

$ /usr/bin/time ./matcher < enwik8
1:
        0.08 real         0.07 user         0.00 sys

In this case there aren't any matches, but every line is searched. The fastest possible way to search that file with SIMD AArch64 or AVX2 instructions is about 20ms.

haihuayang · 2024-11-01T21:04:15Z

Thank you for the detailed information and suggested workaround.
In my use case, the number of rules can be quite large, with each rule containing a varying number of wildcards. To assess the limits of rule capacity, I created tests to measure how many rules can be supported. My approach is to divide the rules into multiple groups when complexity becomes high, then generate a match function for each group and call each function sequentially until one returns successfully. Since having too many groups impacts runtime performance, I aim to minimize the number of groups for the input rules while also keeping build time reasonable (e.g., within minutes).

genivia-inc · 2024-11-06T18:01:44Z

Not sure if I understand exactly what your setup is and why this would test regex matching such that the performance results would be generally useful. It sounds very specialized to me. The extensive use of .* wildcards can be problematic for regex engines, either exploding the states (DFA) or exploding the search/match time (NFA). This is well understood and accepted by the community. In fact, wildcard matching (or the lack of understanding it) has caused significant issues such as the recent outage: "The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs." and also this one and there are other instances. All caused be backtracking NFA-based regex engines. But a DFA might be too large to avoid these problems.

haihuayang · 2024-11-11T19:11:22Z

Thank you for the information. In my use case, I need to consider both the time and memory required for generating the regex engine as well as performing searches with it. Given a regular expression, is there a way to estimate the size of the generated DFA?

genivia-inc added the question A technical question that has or needs clarification label Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reflex report exceeds complexity limits #213

reflex report exceeds complexity limits #213

haihuayang commented Oct 28, 2024

genivia-inc commented Oct 28, 2024

genivia-inc commented Oct 28, 2024

genivia-inc commented Oct 29, 2024 •

edited

Loading

haihuayang commented Nov 1, 2024

genivia-inc commented Nov 6, 2024

haihuayang commented Nov 11, 2024

reflex report exceeds complexity limits #213

reflex report exceeds complexity limits #213

Comments

haihuayang commented Oct 28, 2024

genivia-inc commented Oct 28, 2024

genivia-inc commented Oct 28, 2024

genivia-inc commented Oct 29, 2024 • edited Loading

haihuayang commented Nov 1, 2024

genivia-inc commented Nov 6, 2024

haihuayang commented Nov 11, 2024

genivia-inc commented Oct 29, 2024 •

edited

Loading