-
Notifications
You must be signed in to change notification settings - Fork 927
Optimize CRC16 using SSE CLMUL instruction #2691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @roshkhatri , I would like to request your help of triggering the |
eefa2e0
to
b93048a
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## unstable #2691 +/- ##
============================================
+ Coverage 72.19% 72.59% +0.40%
============================================
Files 128 128
Lines 71005 71290 +285
============================================
+ Hits 51263 51756 +493
+ Misses 19742 19534 -208
🚀 New features to boost your workflow:
|
Benchmark ran on this commit: Benchmark Comparison by ConfigurationConfiguration:
GET
MGET
MSET
SET
Configuration:
GET
SET
Configuration:
GET
SET
|
I believe the benchmarks done with valkey-benchmark in cluster mode uses keys on the form So in this sense, this is maybe not a fair way to benchmark the crc16 implementation. |
Hi @zuiderkwast ,
So are there any issues about refining the benchmark? |
Optimizing the CRC16 is the only scenario where this matters, so I guess it's not worth changing the way valkey-benchmark creates keys to route them to each cluster node. (It does this in a somewhat reverse way: It selects a cluster node and then it creates a key that maps to that node.) What you can do instead is to use a single valkey node running in cluster mode and owning all the slots. Then run valkey-benchmark in non-cluster mode. Valkey-benchmark will not be aware that it's connected to a cluster so the keys will not contain the curly braces. They will look like Something like this: $ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp
$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384 Now it should be possible to run valkey-benchmark against this node. $ ./valkey-benchmark --threads 3 -P 10 -n 10000000 -r 1000000 -t set,get -q Note that the benchark creates keys and it doesn't delete them afterwards, so you can either restart the server or delete all keys using An idea: For testing the CRC16, it should be enough to run only the GET command with no keys stored on the server. The server will still need to run the CRC16 calculation. For this, use |
Hi @zuiderkwast , Thanks for your procedures of benchmarking. I would like to show some benchmark results: Environment
Baseline (commit
|
I guess CLMUL is using more instructions than the table lookup. Maybe the current lookup table implementation is actually optimal for our use case...? |
... or will you try to get rid of the bit reverse code and then maybe CLMUL is actually faster? You can try if you want. Did you forget to |
Hi @zuiderkwast ,
As the input length is just 10~20 bytes, I think we can try multiple bytes tabular LUT (the CLMUL usually applies on much longer length input such as 4 KB).
I find a possible solution that no need to do bit reverse (https://github.com/awesomized/crc-fast-rust/blob/main/src/crc32/algorithm.rs#L164). I will do some experiments to check whether a bit shifting can eliminate bit reverse is possible.
Thanks for reminding me. I will commit later. |
05b1c2c
to
6c91556
Compare
Hi @zuiderkwast , I would like to share some experiment result.
I think the reason that the forward implementation gets inferior performance compared to baseline is that the input message length is just 10 ~ 20 bytes (the CLMUL instruction needs 3 ~ 8 cycles). |
Yes, it makes sense. I found in this code https://github.com/TobiasBengtsson/crc-fast-rs/blob/master/crc-fast-gen/src/lib.rs#L222 they use the table lookup for payload < 128 and they use SIMD algorithms only for larger payload. So maybe the table lookup is optimal for our use case? That is my guess. |
Hi @zuiderkwast ,
I agree. I think I will work on table lookup with multiple bytes. If there are no any further concerns, I will close this PR. |
A table with multiple bytes means a much larger table than the current 512 byte table? I think it will fill up the L1 cache, causing evictions of other important data that we want to keep in the L1 cache. I'm not sure it's a good idea. Are you sure you want to try it?
Yes, you can close this PR. Thank you for your effort! |
Yes. Taking processing 4 bytes at once, this will create four pre-computed tables, which becomes 2KB (512 * 4).
I will give a try, and if I get an inferior result on benchmark. I will just leave some findings in related issue so that we have a record. |
No description provided.