-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reflex report exceeds complexity limits #213
Comments
Thanks, will take a look at this example. Please note that DFA construction of regex patterns may in extreme pathological cases result in DFA state explosion where the number of states is exponentially large (search online for "DFA state explosion"). This is well known and theoretically proven, but practically it is unlikely to happen, unless for such pathological cases that are easy to find online. When that happens, NFA is the better option and also supported by RE/flex using So this is likely not a "bug" you found, but a "feature" that we can't fundamentally change when using DFAs for pattern matching. I am working an a RE/flex update that will be released soon. The update speeds up searching with RE/flex as already implemented in ugrep v7, but that update won't impact the DFA construction process. |
To show why this is so, you can try the following yourself. Download reflex on a Linux box and build it with
Then type a regex, such as Now, DFA construction time depends on the method used e.g. Thompson construction or direct DFA construction, so there are subtle differences in time and size, but in the end the state explosion is something that no method can avoid. State minimization is possible, but that is something that is done after construction. Also, the DFAs for |
Happy to help you out, so let me suggest that this "test" is not a good one to test performance and there is a better way for this kind of regex problem. Regex is not a magical tool that can do whatever you want and expect it to work fast. Regex is like writing a program. Some programs (regex) are just not so great to use as a "blunt hammer" and will perform lousy. In this example you have the Rather, you want to use regex to find patterns in the input and then combine the findings for each line to check what was found on that line. This is much faster and not hard to do, at least not with RE/flex using %{
#include <stdio.h>
#include <vector>
size_t last_lineno = 1;
std::vector<int> hits;
void found(size_t lineno, int num)
{
if (lineno != last_lineno)
{
int mask[7] = {}; // 7 patterns of 3 strings
// memset(mask, 0, sizeof(mask));
for (std::vector<int>::iterator it = hits.begin(); it != hits.end(); ++it)
{
int n = *it - 1;
int j = n / 3, k = n % 3;
// check if we found previous strings, then set bit
int p = ((1 << k) - 1);
if ((mask[j] & p) == p)
mask[j] |= 1 << k;
}
printf("%zu:", last_lineno);
for (int i = 0; i < 7; ++i)
if (mask[i] == 7)
printf(" %d", i + 1);
printf("\n");
last_lineno = lineno;
hits.clear();
}
hits.push_back(num);
}
%}
%option fast find flex noyywrap
%%
0yGzqGtqP6C7WmFcWo4C { found(yylineno, 1); }
58CZqoCgRH5f1SnTjoYc { found(yylineno, 2); }
gmpl55gVTrpCmb9sGktn { found(yylineno, 3); }
82lS7xASLLHEG7YYIbSm { found(yylineno, 4); }
UTIFP0YeJE8pvumxDuQO { found(yylineno, 5); }
VqROVkTHYUyxwGzPTEYP { found(yylineno, 6); }
Dn1m5zW7AlYdZh2f1fXm { found(yylineno, 7); }
zkctITsNzQNlAUYrIi1W { found(yylineno, 8); }
AzeYznQqsWNjwW0cxKHN { found(yylineno, 9); }
fKYQD1rZelcMTJviZ6n7 { found(yylineno, 10); }
ntI3oSSVDmOmvQFgmKx3 { found(yylineno, 11); }
OIATTMBy45kqhqDY9MSE { found(yylineno, 12); }
txA1P3XxndVSEzEL2rWT { found(yylineno, 13); }
pDPBeJhCirz2FabZjRG8 { found(yylineno, 14); }
21CNK5EXVf28bJRCvM4G { found(yylineno, 15); }
ZqWAQvg9bGkbiAK3RNXS { found(yylineno, 16); }
hLvkAp2cJbCDIHRXUH7J { found(yylineno, 17); }
iPm9bqneWI1pxcbYSWtw { found(yylineno, 18); }
dZZoCOpOXYhhRBExj1ED { found(yylineno, 19); }
DuydMZl2LkdjNiGwAXd3 { found(yylineno, 20); }
Us6zpeA02jrVTP6tVNMp { found(yylineno, 21); }
<<EOF>> { found(yylineno, 0); return 0; }
%%
int main()
{
return yyFlexLexer().yylex();
} This uses a vector of ints where each int refers to a pattern found on a line. Then when we switch lines, we output the results of the previous line. A bit Benchmarks are only useful if they tell you what to expect in general for common regex pattern search problems, not for problems for which regex in that specific way to test are no good. One can choose DFA or NFA matching to see what method works faster, but that is pointless because we know these differences very well (DFA takes more time to construct, NFA takes more time to match input) so it's not surprising. EDIT: to show how fast this is, let me demonstrate on a 100MB file. It takes only 80ms to search it: $ /usr/bin/time ./matcher < enwik8
1:
0.08 real 0.07 user 0.00 sys In this case there aren't any matches, but every line is searched. The fastest possible way to search that file with SIMD AArch64 or AVX2 instructions is about 20ms. |
Thank you for the detailed information and suggested workaround. |
Not sure if I understand exactly what your setup is and why this would test regex matching such that the performance results would be generally useful. It sounds very specialized to me. The extensive use of |
Thank you for the information. In my use case, I need to consider both the time and memory required for generating the regex engine as well as performing searches with it. Given a regular expression, is there a way to estimate the size of the generated DFA? |
Hi,
I did some test to measure the performance of reflex, in the below test reflex is very slow and eventually report error "exceeds complexity limits"
but if i modify the rule to the below which is equivelent for this use case, then reflex completes successfully and much faster
I am wondering if this is a bug.
Thanks,
Haihua
The text was updated successfully, but these errors were encountered: