Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reflex report exceeds complexity limits #213

Open
haihuayang opened this issue Oct 28, 2024 · 6 comments
Open

reflex report exceeds complexity limits #213

haihuayang opened this issue Oct 28, 2024 · 6 comments
Labels
question A technical question that has or needs clarification

Comments

@haihuayang
Copy link

Hi,
I did some test to measure the performance of reflex, in the below test reflex is very slow and eventually report error "exceeds complexity limits"

%%
^.*0yGzqGtqP6C7WmFcWo4C.*58CZqoCgRH5f1SnTjoYc.*gmpl55gVTrpCmb9sGktn     |
^.*82lS7xASLLHEG7YYIbSm.*UTIFP0YeJE8pvumxDuQO.*VqROVkTHYUyxwGzPTEYP     |
^.*Dn1m5zW7AlYdZh2f1fXm.*zkctITsNzQNlAUYrIi1W.*AzeYznQqsWNjwW0cxKHN     |
^.*fKYQD1rZelcMTJviZ6n7.*ntI3oSSVDmOmvQFgmKx3.*OIATTMBy45kqhqDY9MSE     |
^.*txA1P3XxndVSEzEL2rWT.*pDPBeJhCirz2FabZjRG8.*21CNK5EXVf28bJRCvM4G     |
^.*ZqWAQvg9bGkbiAK3RNXS.*hLvkAp2cJbCDIHRXUH7J.*iPm9bqneWI1pxcbYSWtw     |
^.*dZZoCOpOXYhhRBExj1ED.*DuydMZl2LkdjNiGwAXd3.*Us6zpeA02jrVTP6tVNMp     return true;
.       return false;
%%

$ /usr/local/nutanix/minerva/bin/reflex --full   --outfile=test21.cpp test21.l
reflex: error: malformed regular expression
error at position 497
djNiGwAXd3.*Us6zpeA02jrVTP6tVNMp)|(.)
        exceeds complexity limits___/

but if i modify the rule to the below which is equivelent for this use case, then reflex completes successfully and much faster

%%
0yGzqGtqP6C7WmFcWo4C.*58CZqoCgRH5f1SnTjoYc.*gmpl55gVTrpCmb9sGktn        |
82lS7xASLLHEG7YYIbSm.*UTIFP0YeJE8pvumxDuQO.*VqROVkTHYUyxwGzPTEYP        |
Dn1m5zW7AlYdZh2f1fXm.*zkctITsNzQNlAUYrIi1W.*AzeYznQqsWNjwW0cxKHN        |
fKYQD1rZelcMTJviZ6n7.*ntI3oSSVDmOmvQFgmKx3.*OIATTMBy45kqhqDY9MSE        |
txA1P3XxndVSEzEL2rWT.*pDPBeJhCirz2FabZjRG8.*21CNK5EXVf28bJRCvM4G        |
ZqWAQvg9bGkbiAK3RNXS.*hLvkAp2cJbCDIHRXUH7J.*iPm9bqneWI1pxcbYSWtw        |
dZZoCOpOXYhhRBExj1ED.*DuydMZl2LkdjNiGwAXd3.*Us6zpeA02jrVTP6tVNMp        return true;
.       return false;
%%

I am wondering if this is a bug.

Thanks,
Haihua

@genivia-inc
Copy link
Member

Thanks, will take a look at this example.

Please note that DFA construction of regex patterns may in extreme pathological cases result in DFA state explosion where the number of states is exponentially large (search online for "DFA state explosion"). This is well known and theoretically proven, but practically it is unlikely to happen, unless for such pathological cases that are easy to find online.

When that happens, NFA is the better option and also supported by RE/flex using %o matcher=pcre2_perl or with command-line option -m pcre2_perl then link with PCRE2. Or use %o matcher=boost for POSIX matching with Boost.Regex. NFA matching is generally slower than DFA matching. Lex and Flex lexers are always POSIX matchers and DFA-based for speed, since DFA construction time is part of the compile time to create a lexer. Perl matching is not POSIX matching and not always suitable for lexical analysis, see the explanation in the RE/flex documentation.

So this is likely not a "bug" you found, but a "feature" that we can't fundamentally change when using DFAs for pattern matching.

I am working an a RE/flex update that will be released soon. The update speeds up searching with RE/flex as already implemented in ugrep v7, but that update won't impact the DFA construction process.

@genivia-inc genivia-inc added the question A technical question that has or needs clarification label Oct 28, 2024
@genivia-inc
Copy link
Member

To show why this is so, you can try the following yourself. Download reflex on a Linux box and build it with ./build.sh as per instructions. In the reflex/tests directory enter

./test.sh

Then type a regex, such as .*ab.*cd.*ef and .*ab.*cd.*ef|.*st.*uv.*wx for example, which are already large DFAs as the generated PDF will show (a dump.pdf is generated by the test.sh script). You can see that the DFA states and edges grow a lot to represent the pattern, which is due to the .* between the letter sequences. The longer the letter sequences are in the pattern, the exponentially more states are produced.

Now, DFA construction time depends on the method used e.g. Thompson construction or direct DFA construction, so there are subtle differences in time and size, but in the end the state explosion is something that no method can avoid. State minimization is possible, but that is something that is done after construction. Also, the DFAs for .*ab.*cd.*ef and .*ab.*cd.*ef|.*st.*uv.*wx are minimal as far as I can tell, so there is no possible size reduction.

@genivia-inc
Copy link
Member

genivia-inc commented Oct 29, 2024

Happy to help you out, so let me suggest that this "test" is not a good one to test performance and there is a better way for this kind of regex problem.

Regex is not a magical tool that can do whatever you want and expect it to work fast. Regex is like writing a program. Some programs (regex) are just not so great to use as a "blunt hammer" and will perform lousy. In this example you have the .* separate patterns. That's not how you want to find these patterns and then check if the patterns found are on the same line. It also does not tell you the results accurately, because lines can have any of these combinations of three patterns that make up any of the seven patterns we look for.

Rather, you want to use regex to find patterns in the input and then combine the findings for each line to check what was found on that line. This is much faster and not hard to do, at least not with RE/flex using find to generate a search engine (instead of a tokenizer):

%{
#include <stdio.h>
#include <vector>
size_t last_lineno = 1;
std::vector<int> hits;
void found(size_t lineno, int num)
{
  if (lineno != last_lineno)
  {
    int mask[7] = {}; // 7 patterns of 3 strings
    // memset(mask, 0, sizeof(mask));
    for (std::vector<int>::iterator it = hits.begin(); it != hits.end(); ++it)
    {
      int n = *it - 1;
      int j = n / 3, k = n % 3;
      // check if we found previous strings, then set bit
      int p = ((1 << k) - 1);
      if ((mask[j] & p) == p)
        mask[j] |= 1 << k;
    }
    printf("%zu:", last_lineno);
    for (int i = 0; i < 7; ++i)
      if (mask[i] == 7)
        printf(" %d", i + 1);
    printf("\n");
    last_lineno = lineno;
    hits.clear();
  }
  hits.push_back(num);
}
%}
%option fast find flex noyywrap
%%
0yGzqGtqP6C7WmFcWo4C    { found(yylineno, 1); }
58CZqoCgRH5f1SnTjoYc    { found(yylineno, 2); }
gmpl55gVTrpCmb9sGktn    { found(yylineno, 3); }
82lS7xASLLHEG7YYIbSm    { found(yylineno, 4); }
UTIFP0YeJE8pvumxDuQO    { found(yylineno, 5); }
VqROVkTHYUyxwGzPTEYP    { found(yylineno, 6); }
Dn1m5zW7AlYdZh2f1fXm    { found(yylineno, 7); }
zkctITsNzQNlAUYrIi1W    { found(yylineno, 8); }
AzeYznQqsWNjwW0cxKHN    { found(yylineno, 9); }
fKYQD1rZelcMTJviZ6n7    { found(yylineno, 10); }
ntI3oSSVDmOmvQFgmKx3    { found(yylineno, 11); }
OIATTMBy45kqhqDY9MSE    { found(yylineno, 12); }
txA1P3XxndVSEzEL2rWT    { found(yylineno, 13); }
pDPBeJhCirz2FabZjRG8    { found(yylineno, 14); }
21CNK5EXVf28bJRCvM4G    { found(yylineno, 15); }
ZqWAQvg9bGkbiAK3RNXS    { found(yylineno, 16); }
hLvkAp2cJbCDIHRXUH7J    { found(yylineno, 17); }
iPm9bqneWI1pxcbYSWtw    { found(yylineno, 18); }
dZZoCOpOXYhhRBExj1ED    { found(yylineno, 19); }
DuydMZl2LkdjNiGwAXd3    { found(yylineno, 20); }
Us6zpeA02jrVTP6tVNMp    { found(yylineno, 21); }
<<EOF>>                 { found(yylineno, 0); return 0; }
%%
int main()
{
  return yyFlexLexer().yylex();
}

This uses a vector of ints where each int refers to a pattern found on a line. Then when we switch lines, we output the results of the previous line. A bit mask array is used to collect the three bits referring to the three strings of the 7 patterns. The first bits must be set to we make sure that we find the patterns in the order we need (thus ignoring patterns that are out of order).

Benchmarks are only useful if they tell you what to expect in general for common regex pattern search problems, not for problems for which regex in that specific way to test are no good. One can choose DFA or NFA matching to see what method works faster, but that is pointless because we know these differences very well (DFA takes more time to construct, NFA takes more time to match input) so it's not surprising.

EDIT: to show how fast this is, let me demonstrate on a 100MB file. It takes only 80ms to search it:

$ /usr/bin/time ./matcher < enwik8
1:
        0.08 real         0.07 user         0.00 sys

In this case there aren't any matches, but every line is searched. The fastest possible way to search that file with SIMD AArch64 or AVX2 instructions is about 20ms.

@haihuayang
Copy link
Author

Thank you for the detailed information and suggested workaround.
In my use case, the number of rules can be quite large, with each rule containing a varying number of wildcards. To assess the limits of rule capacity, I created tests to measure how many rules can be supported. My approach is to divide the rules into multiple groups when complexity becomes high, then generate a match function for each group and call each function sequentially until one returns successfully. Since having too many groups impacts runtime performance, I aim to minimize the number of groups for the input rules while also keeping build time reasonable (e.g., within minutes).

@genivia-inc
Copy link
Member

Not sure if I understand exactly what your setup is and why this would test regex matching such that the performance results would be generally useful. It sounds very specialized to me. The extensive use of .* wildcards can be problematic for regex engines, either exploding the states (DFA) or exploding the search/match time (NFA). This is well understood and accepted by the community. In fact, wildcard matching (or the lack of understanding it) has caused significant issues such as the recent outage: "The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs." and also this one and there are other instances. All caused be backtracking NFA-based regex engines. But a DFA might be too large to avoid these problems.

@haihuayang
Copy link
Author

Thank you for the information. In my use case, I need to consider both the time and memory required for generating the regex engine as well as performing searches with it. Given a regular expression, is there a way to estimate the size of the generated DFA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question A technical question that has or needs clarification
Projects
None yet
Development

No branches or pull requests

2 participants