-
-
Notifications
You must be signed in to change notification settings - Fork 866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark is unfair. find(1) should not be used for grepping. #1395
Comments
Unfortunately, the regex implementation in GNU
This is on a checkout of rust-lang/rust (see the
I agree that the benchmarks should probably be updated, and we should make the comparisons as fair as possible. #893 is relevant. |
Relevant stuff: I understand that people use them out of habit, but the design of the program is rather dubious. But yeah, I could change the "should not be used" to maybe "are suboptimal, both in performance terms, and in simplicity". Piping to grep(1), you get faster results, and you don't even need to check the find(1) manual page for all the similar but different options that it has for grepping files.
The pipe shouldn't be a bottleneck. The bottleneck is usually I/O. $ time find >/dev/null
real 0m0.327s
user 0m0.091s
sys 0m0.233s
$ time find | grep -i '[0-9]\.c$' >/dev/null
real 0m0.335s
user 0m0.098s
sys 0m0.250s You can see that piping all filenames only adds a little bit of time to a simple find(1). Also, not only the real time is important. fdfind(1), with the appropriate patches and optimizations may beat alx@debian:~$ time find | grep -i '[0-9]\.c$' >/dev/null
real 0m0.358s
user 0m0.098s
sys 0m0.276s
alx@debian:~$ time fdfind -u | grep -i '[0-9]\.c$' >/dev/null
real 0m0.390s
user 0m1.831s
sys 0m6.067s
alx@debian:~$ time fdfind -u '[0-9]\.c$' >/dev/null
real 0m0.385s
user 0m1.846s
sys 0m6.222s That's fine if the bottleneck is in find(1), but if I pipe this command to a more consuming pipeline where find(1) is not the bottleneck, I fear that it may actually be slower, and will occupy the CPU that could be running other tasks. If I have other heavy tasks at the same time, like compiling some software, it will also probably affect the performance of fdfind(1) significantly, while
Those filenames are already broken (they are unportable, according to POSIX https://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_06). Please don't use them. :) If you need them, though, you can use
Thanks. |
Thanks for this link, very cool!
This is often true, but in my personal uses the whole tree is usually in cache (dcache, or at least buffer cache). Most of the overhead then comes from kernel and syscall overhead.
That benchmark still has
Anyway, more important than things like
Agreed.
|
:-)
A 7% lost on the pipe seems quite reasonable. How much does fdfind(1) take on, say,
Yup, those things must run under find(1), but they are actually fast, aren't they? The problem with find(1)'s performance, AFAIK, is just with name filtering. Do you have benchmarks for those things comparing to find(1)?
Interesting; |
Unless you have an incredibly lare number of executables installed, /usr/bin won't be very large, relatively speaking. It is also pretty flat, which I think hinders the parallelizability. |
Well, its near 3.000 executables. |
That's a small thing, actually. Compare to this: alx@debian:~$ time find ~ | wc -l
660487
real 0m0.338s
user 0m0.080s
sys 0m0.273s
alx@debian:~$ time find ~ -type f | wc -l
603577
real 0m0.344s
user 0m0.096s
sys 0m0.262s
alx@debian:~$ time find ~ -type d | wc -l
53914
real 0m0.314s
user 0m0.091s
sys 0m0.223s :) |
Then why does |
Because its inventor was told to do it that way, it seems. Here's the (funny) history of the design of find(1): |
@alejandro-colomar gotcha. So in that case we shouldn't benchmark |
The differencce is that fd(1) is intended for grepping. So we need to benchmark what this project is selling. But I agree that it would be interesting to also test I'll benchmark it when I arrive home. |
Cc: @vegerot New benchmark below in this message. TL;DR:
For some reason, this time fdfind(1) wins by far in ellapsed time. However, this is because it parallelizes the load. That's unfair, because in non-trivial tasks where the other CPUs are loaded, fdfind(1) won't have such an advantage. Another result of my new benchmark is that piping However, the command that generates less load is Here's the new benchmark. (I run a few runs of each command before the ones I show, to cache stuff to be fair.) alx@devuan:~$ /bin/time find ~/src/ -iregex '.*[0-9]\.c$' | wc -l
1.08user 0.26system 0:01.34elapsed 100%CPU (0avgtext+0avgdata 6236maxresident)k
0inputs+0outputs (0major+2487minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time find ~/src/ -iregex '.*[0-9]\.c$' | wc -l
1.09user 0.23system 0:01.33elapsed 99%CPU (0avgtext+0avgdata 6244maxresident)k
0inputs+0outputs (0major+2484minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time find ~/src/ -iregex '.*[0-9]\.c$' | wc -l
1.11user 0.22system 0:01.33elapsed 99%CPU (0avgtext+0avgdata 6272maxresident)k
0inputs+0outputs (0major+2487minor)pagefaults 0swaps
95023 alx@devuan:~$ /bin/time find ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.10user 0.28system 0:00.53elapsed 71%CPU (0avgtext+0avgdata 5784maxresident)k
0inputs+0outputs (0major+1031minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time find ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.14user 0.24system 0:00.54elapsed 71%CPU (0avgtext+0avgdata 5892maxresident)k
0inputs+0outputs (0major+1032minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time find ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.10user 0.28system 0:00.54elapsed 71%CPU (0avgtext+0avgdata 6020maxresident)k
0inputs+0outputs (0major+1033minor)pagefaults 0swaps
95023 alx@devuan:~$ /bin/time fdfind -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.77user 0.48system 0:00.06elapsed 1870%CPU (0avgtext+0avgdata 35400maxresident)k
0inputs+0outputs (0major+9712minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.69user 0.54system 0:00.06elapsed 1868%CPU (0avgtext+0avgdata 42100maxresident)k
0inputs+0outputs (0major+10872minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.68user 0.45system 0:00.05elapsed 2063%CPU (0avgtext+0avgdata 32700maxresident)k
0inputs+0outputs (0major+8965minor)pagefaults 0swaps
95023 alx@devuan:~$ /bin/time fdfind -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.49user 0.47system 0:00.38elapsed 251%CPU (0avgtext+0avgdata 39260maxresident)k
0inputs+0outputs (0major+10958minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.48user 0.50system 0:00.39elapsed 250%CPU (0avgtext+0avgdata 39620maxresident)k
0inputs+0outputs (0major+10665minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.48user 0.51system 0:00.39elapsed 256%CPU (0avgtext+0avgdata 40668maxresident)k
0inputs+0outputs (0major+11167minor)pagefaults 0swaps
95023 Here's a similar benchmark run in several directories across different file systems, which shows no significant differences. alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ -iregex '.*[0-9]\.c$' | wc -l
1.33user 0.29system 0:01.63elapsed 99%CPU (0avgtext+0avgdata 10324maxresident)k
0inputs+0outputs (0major+2081minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ -iregex '.*[0-9]\.c$' | wc -l
1.32user 0.28system 0:01.61elapsed 99%CPU (0avgtext+0avgdata 10268maxresident)k
0inputs+0outputs (0major+2081minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ -iregex '.*[0-9]\.c$' | wc -l
1.30user 0.29system 0:01.60elapsed 99%CPU (0avgtext+0avgdata 10204maxresident)k
0inputs+0outputs (0major+2082minor)pagefaults 0swaps
95866 alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.09user 0.34system 0:00.61elapsed 72%CPU (0avgtext+0avgdata 9904maxresident)k
0inputs+0outputs (0major+2016minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.11user 0.32system 0:00.61elapsed 72%CPU (0avgtext+0avgdata 9784maxresident)k
0inputs+0outputs (0major+2013minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time find ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.12user 0.32system 0:00.61elapsed 72%CPU (0avgtext+0avgdata 9860maxresident)k
0inputs+0outputs (0major+2015minor)pagefaults 0swaps
95866 alx@devuan:~$ sudo /bin/time fdfind -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.83user 0.67system 0:00.07elapsed 1975%CPU (0avgtext+0avgdata 43120maxresident)k
0inputs+0outputs (0major+13598minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time fdfind -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.86user 0.62system 0:00.07elapsed 1932%CPU (0avgtext+0avgdata 48424maxresident)k
0inputs+0outputs (0major+12847minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time fdfind -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.81user 0.65system 0:00.07elapsed 1871%CPU (0avgtext+0avgdata 50764maxresident)k
0inputs+0outputs (0major+13807minor)pagefaults 0swaps
95866 alx@devuan:~$ sudo /bin/time fdfind -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.58user 0.62system 0:00.42elapsed 282%CPU (0avgtext+0avgdata 61820maxresident)k
0inputs+0outputs (0major+16526minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time fdfind -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.59user 0.68system 0:00.42elapsed 300%CPU (0avgtext+0avgdata 57180maxresident)k
0inputs+0outputs (0major+17473minor)pagefaults 0swaps
95866
alx@devuan:~$ sudo /bin/time fdfind -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.55user 0.57system 0:00.41elapsed 274%CPU (0avgtext+0avgdata 62628maxresident)k
0inputs+0outputs (0major+16952minor)pagefaults 0swaps
95866 |
Sounds good.👍 Let's open a patch to GNU Coreutils to delete that functionality from GNU |
Heh! I think that's going to find a wall of concrete. That would break scripts, which means will never be removed. BTW, find(1) is from GNU findutils, not coreutils. Blame the people at Bell Labs who wrote find(1), or its specification. We now have it. But let's not use it. We have better tools, like grep(1). |
On the other hand, it might be interesting to fork find(1) into a new program f(1), which would get rid of all the crap from find(1). Just find stuff. It looks like a fun project for me. |
Thanks for doing those. Here's my attempt to replicate them, with
True, but for me
That's not surprising to me. With What does surprise me is that
Don't forget about
Personally I use But |
Hmmm. Thanks! For completeness, here are the results of fdfind(1) with -j1 in my system. In my system, And alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.25user 0.22system 0:00.43elapsed 109%CPU (0avgtext+0avgdata 9860maxresident)k
0inputs+0outputs (0major+2695minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.26user 0.23system 0:00.44elapsed 110%CPU (0avgtext+0avgdata 10072maxresident)k
0inputs+0outputs (0major+2799minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~/src/ | wc -l
0.27user 0.22system 0:00.45elapsed 109%CPU (0avgtext+0avgdata 10488maxresident)k
0inputs+0outputs (0major+2813minor)pagefaults 0swaps
95023 alx@devuan:~$ /bin/time fdfind -j1 -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.70user 0.38system 0:00.77elapsed 140%CPU (0avgtext+0avgdata 10400maxresident)k
0inputs+0outputs (0major+2628minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -j1 -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.68user 0.37system 0:00.75elapsed 139%CPU (0avgtext+0avgdata 11720maxresident)k
0inputs+0outputs (0major+3005minor)pagefaults 0swaps
95023
alx@devuan:~$ /bin/time fdfind -j1 -HI . ~/src/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.67user 0.41system 0:00.77elapsed 140%CPU (0avgtext+0avgdata 11748maxresident)k
0inputs+0outputs (0major+3143minor)pagefaults 0swaps
95023 And with several dirs: alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.27user 0.31system 0:00.63elapsed 93%CPU (0avgtext+0avgdata 15572maxresident)k
2080inputs+0outputs (0major+5396minor)pagefaults 0swaps
95866
alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.25user 0.31system 0:00.52elapsed 108%CPU (0avgtext+0avgdata 15672maxresident)k
0inputs+0outputs (0major+5398minor)pagefaults 0swaps
95866
alx@devuan:~$ /bin/time fdfind -j1 -HI '.*[0-9]\.c$' ~ ~/src/ ~/mail/ /boot/ | wc -l
0.28user 0.27system 0:00.51elapsed 109%CPU (0avgtext+0avgdata 15972maxresident)k
0inputs+0outputs (0major+8012minor)pagefaults 0swaps
95866 alx@devuan:~$ /bin/time fdfind -j1 -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.79user 0.46system 0:00.89elapsed 140%CPU (0avgtext+0avgdata 15824maxresident)k
0inputs+0outputs (0major+5679minor)pagefaults 0swaps
95866
alx@devuan:~$ /bin/time fdfind -j1 -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.80user 0.40system 0:00.86elapsed 139%CPU (0avgtext+0avgdata 14976maxresident)k
0inputs+0outputs (0major+5552minor)pagefaults 0swaps
95866
alx@devuan:~$ /bin/time fdfind -j1 -HI . ~ ~/src/ ~/mail/ /boot/ | grep -i '/[^/]*[0-9]\.c$' | wc -l
0.80user 0.44system 0:00.89elapsed 139%CPU (0avgtext+0avgdata 15204maxresident)k
0inputs+0outputs (0major+8036minor)pagefaults 0swaps
95866 |
This may be a resurgence of / incomplete fix for #1313: $ strace -cf -e 'write' -- fd -j1 -u | grep -i '/[^/]*[0-9]\.c$' | wc -l
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.392255 11 32957 write
$ strace -cf -e 'write' -- find | grep -i '/[^/]*[0-9]\.c$' | wc -l
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.023571 7 3104 write I think my reasoning in #1452 doesn't really apply to the |
BTW, maybe you should default to -j1 when writing to a pipe? Usually, the pipe will be slower consuming the path names than find(1) producing them. |
There are two important cases where this isn't true:
|
Hi,
I believe the benchmarks you provide compared to find(1) are unfair. find(1) should not be used for grepping; following the Unix principles, find(1) should just find, and grep(1) should be responsible for filtering the output of find(1).
If we pipe find(1) to grep(1), the performance is significantly faster than just using find(1):
find | grep
seems to be faster thanfdfind
in my own simple test:Can you please provide benchmarks against this pipeline in your readme?
What version of
fd
are you using?[paste the output of
fd --version
here]$ dpkg -l | grep fd-find ii fd-find 8.7.0-3+b1 amd64 Simple, fast and user-friendly alternative to find
Just for completeness, here's my CPU:
Thanks!
The text was updated successfully, but these errors were encountered: