Optimize CRC16 using SSE CLMUL instruction #2691

Cuda-Chen · 2025-10-06T13:33:42Z

No description provided.

Cuda-Chen · 2025-10-06T13:35:07Z

I would like to request your help of triggering the run-benchmark CI, thanks!

codecov · 2025-10-06T21:16:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.59%. Comparing base (d5bb986) to head (05b1c2c).
⚠️ Report is 16 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #2691      +/-   ##
============================================
+ Coverage     72.19%   72.59%   +0.40%     
============================================
  Files           128      128              
  Lines         71005    71290     +285     
============================================
+ Hits          51263    51756     +493     
+ Misses        19742    19534     -208

Files with missing lines	Coverage Δ
src/crc16.c	`100.00% <100.00%> (ø)`

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2025-10-06T22:54:16Z

Benchmark ran on this commit: b93048a

Benchmark Comparison by Configuration

Configuration:

clients: 50
cluster_mode: False
data_size: 16
pipeline: 10
requests: 100000000
tls: False
warmup: 120

GET

Metric	Baseline	PR	Diff	% Change
rps	831158.44	836393.19	5234.75	+0.63%
latency_avg_ms	0.54	0.53	-0.01	-0.93%
latency_p50_ms	0.52	0.51	-0.01	-1.54%
latency_p95_ms	0.72	0.71	-0.01	-1.11%
latency_p99_ms	0.81	0.80	-0.01	-0.99%

MGET

Metric	Baseline	PR	Diff	% Change
rps	298839.31	299801.53	962.22	+0.32%
latency_avg_ms	1.55	1.54	-0.01	-0.52%
latency_p50_ms	1.57	1.56	-0.02	-1.02%
latency_p95_ms	1.90	1.89	-0.01	-0.42%
latency_p99_ms	1.97	1.95	-0.02	-0.81%

MSET

Metric	Baseline	PR	Diff	% Change
rps	226919.45	227264.97	345.52	+0.15%
latency_avg_ms	2.12	2.12	-0.00	-0.14%
latency_p50_ms	2.33	2.33	0.00	+0.00%
latency_p95_ms	2.60	2.60	0.00	+0.00%
latency_p99_ms	2.68	2.68	0.00	+0.00%

SET

Metric	Baseline	PR	Diff	% Change
rps	756212.25	754221.75	-1990.50	-0.26%
latency_avg_ms	0.59	0.60	0.00	+0.34%
latency_p50_ms	0.57	0.57	0.00	+0.00%
latency_p95_ms	0.79	0.79	0.00	+0.00%
latency_p99_ms	0.87	0.87	0.00	+0.00%

Configuration:

clients: 50
cluster_mode: True
data_size: 16
pipeline: 10
requests: 100000000
tls: False
warmup: 120

GET

Metric	Baseline	PR	Diff	% Change
rps	746642.00	711642.44	-34999.56	-4.69%
latency_avg_ms	0.60	0.64	0.03	+5.29%
latency_p50_ms	0.57	0.61	0.03	+5.57%
latency_p95_ms	0.80	0.84	0.04	+5.01%
latency_p99_ms	0.89	0.92	0.03	+3.61%

SET

Metric	Baseline	PR	Diff	% Change
rps	650296.50	621067.88	-29228.62	-4.49%
latency_avg_ms	0.70	0.74	0.04	+4.96%
latency_p50_ms	0.68	0.73	0.05	+7.07%
latency_p95_ms	0.94	0.97	0.04	+4.28%
latency_p99_ms	1.10	1.12	0.02	+1.45%

Configuration:

clients: 650
cluster_mode: False
data_size: 512
io_threads: 8
pipeline: 10
requests: 100000000
tls: False
warmup: 180

GET

Metric	Baseline	PR	Diff	% Change
rps	808924.06	800531.56	-8392.50	-1.04%
latency_avg_ms	4.06	4.11	0.04	+1.06%
latency_p50_ms	4.06	4.11	0.05	+1.18%
latency_p95_ms	4.47	4.54	0.07	+1.61%
latency_p99_ms	4.58	4.65	0.07	+1.57%

SET

Metric	Baseline	PR	Diff	% Change
rps	815361.44	809198.94	-6162.50	-0.76%
latency_avg_ms	4.05	4.08	0.03	+0.77%
latency_p50_ms	4.05	4.07	0.02	+0.59%
latency_p95_ms	4.38	4.38	0.00	+0.00%
latency_p99_ms	4.61	4.54	-0.07	-1.56%

zuiderkwast · 2025-10-07T08:03:44Z

I believe the benchmarks done with valkey-benchmark in cluster mode uses keys on the form key{XYZ}:1234567890. When curly braces are used within a key name, only the chars between {...} are hashed using CRC16, instead of the whole key. The {...} syntax is a common and useful way to make different keys be hashed to the same cluster slot and be stored in the same cluster node, but in real world scenarios, the part between the curly braces is typically longer. Let's say that part is often 10-20 bytes. Valkey-benchmark only puts 2-3 chars between braces.

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

Cuda-Chen · 2025-10-07T12:14:16Z

Hi @zuiderkwast ,

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

So are there any issues about refining the benchmark?
I think I can give a hand of such issues.
Or, maybe I should design another benchmarking scenario?

zuiderkwast · 2025-10-07T15:46:03Z

Hi @zuiderkwast ,

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

So are there any issues about refining the benchmark? I think I can give a hand of such issues. Or, maybe I should design another benchmarking scenario?h

Optimizing the CRC16 is the only scenario where this matters, so I guess it's not worth changing the way valkey-benchmark creates keys to route them to each cluster node. (It does this in a somewhat reverse way: It selects a cluster node and then it creates a key that maps to that node.)

What you can do instead is to use a single valkey node running in cluster mode and owning all the slots. Then run valkey-benchmark in non-cluster mode. Valkey-benchmark will not be aware that it's connected to a cluster so the keys will not contain the curly braces. They will look like key:1234567890 instead. The valkey node will however need to compute the CRC16 of each key to lookup the cluster slot. I think it will be more realistic in terms of CRC16 subject length.

Something like this:

$ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp

$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384

Now it should be possible to run valkey-benchmark against this node.

$ ./valkey-benchmark --threads 3 -P 10 -n 10000000 -r 1000000 -t set,get -q

Note that the benchark creates keys and it doesn't delete them afterwards, so you can either restart the server or delete all keys using ./valkey-cli flushall before running the benchmark again.

An idea: For testing the CRC16, it should be enough to run only the GET command with no keys stored on the server. The server will still need to run the CRC16 calculation. For this, use -t get instead of -t set,get.

Cuda-Chen · 2025-10-08T14:04:32Z

Hi @zuiderkwast ,

Thanks for your procedures of benchmarking.

I would like to show some benchmark results:

Environment

CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
OS: Linux 5.15.0-157-generic

Baseline (commit `d5bb986fd592733330ff4042c011f7c64da543ec` on `unstable` branch)

# -t set,get
SET: 293151.97 requests per second, p50=1.455 msec                    
GET: 369248.94 requests per second, p50=1.167 msec

# -t get
GET: 458442.22 requests per second, p50=0.983 msec

PR

# -t set,get
SET: 252194.08 requests per second, p50=1.615 msec                    
GET: 309042.56 requests per second, p50=1.319 msec  

# -t get
GET: 335615.53 requests per second, p50=1.167 msec

Some Thoughts

The CLMUL implementation uses bit reflected implementation, which leaves the room of improvement.

zuiderkwast · 2025-10-08T16:13:57Z

I guess CLMUL is using more instructions than the table lookup. Maybe the current lookup table implementation is actually optimal for our use case...?

zuiderkwast · 2025-10-08T16:17:10Z

... or will you try to get rid of the bit reverse code and then maybe CLMUL is actually faster? You can try if you want.

Did you forget to git add src/unit/test_crc16.c?

Cuda-Chen · 2025-10-10T02:16:56Z

Hi @zuiderkwast ,

I guess CLMUL is using more instructions than the table lookup. Maybe the current lookup table implementation is actually optimal for our use case...?

As the input length is just 10~20 bytes, I think we can try multiple bytes tabular LUT (the CLMUL usually applies on much longer length input such as 4 KB).

... or will you try to get rid of the bit reverse code and then maybe CLMUL is actually faster? You can try if you want.

I find a possible solution that no need to do bit reverse (https://github.com/awesomized/crc-fast-rust/blob/main/src/crc32/algorithm.rs#L164). I will do some experiments to check whether a bit shifting can eliminate bit reverse is possible.

Did you forget to git add src/unit/test_crc16.c?

Thanks for reminding me. I will commit later.

Cuda-Chen · 2025-10-14T07:11:37Z

Hi @zuiderkwast ,

I would like to share some experiment result.
I implement the forward calculation (with some optimizations), and the following block shows the benchmark:

# -t set,get
SET: 265674.81 requests per second, p50=1.615 msec                    
GET: 299652.38 requests per second, p50=1.351 msec

# -t get
GET: 366501.75 requests per second, p50=1.167 msec

I think the reason that the forward implementation gets inferior performance compared to baseline is that the input message length is just 10 ~ 20 bytes (the CLMUL instruction needs 3 ~ 8 cycles).

zuiderkwast · 2025-10-14T07:46:39Z

I think the reason that the forward implementation gets inferior performance compared to baseline is that the input message length is just 10 ~ 20 bytes (the CLMUL instruction needs 3 ~ 8 cycles).

Yes, it makes sense.

I found in this code https://github.com/TobiasBengtsson/crc-fast-rs/blob/master/crc-fast-gen/src/lib.rs#L222 they use the table lookup for payload < 128 and they use SIMD algorithms only for larger payload. So maybe the table lookup is optimal for our use case? That is my guess.

Cuda-Chen · 2025-10-14T12:31:40Z

Hi @zuiderkwast ,

So maybe the table lookup is optimal for our use case? That is my guess.

I agree. I think I will work on table lookup with multiple bytes.

If there are no any further concerns, I will close this PR.

zuiderkwast · 2025-10-14T13:58:42Z

I agree. I think I will work on table lookup with multiple bytes.

A table with multiple bytes means a much larger table than the current 512 byte table? I think it will fill up the L1 cache, causing evictions of other important data that we want to keep in the L1 cache. I'm not sure it's a good idea. Are you sure you want to try it?

If there are no any further concerns, I will close this PR.

Yes, you can close this PR. Thank you for your effort!

Cuda-Chen · 2025-10-14T14:35:09Z

A table with multiple bytes means a much larger table than the current 512 byte table?

Yes. Taking processing 4 bytes at once, this will create four pre-computed tables, which becomes 2KB (512 * 4).

I think it will fill up the L1 cache, causing evictions of other important data that we want to keep in the L1 cache. I'm not sure it's a good idea. Are you sure you want to try it?

I will give a try, and if I get an inferior result on benchmark. I will just leave some findings in related issue so that we have a record.

github-actions bot assigned Cuda-Chen Oct 6, 2025

Cuda-Chen force-pushed the crc16-x86-barrent branch 3 times, most recently from eefa2e0 to b93048a Compare October 6, 2025 14:55

zuiderkwast added the run-benchmark label Oct 6, 2025

github-actions bot removed the run-benchmark label Oct 6, 2025

Optimize CRC16 using SSE CLMUL instruction

6c91556

Cuda-Chen force-pushed the crc16-x86-barrent branch from 05b1c2c to 6c91556 Compare October 14, 2025 06:56

Cuda-Chen closed this Oct 14, 2025

Optimize CRC16 using SSE CLMUL instruction #2691

Optimize CRC16 using SSE CLMUL instruction #2691

Conversation

Cuda-Chen commented Oct 6, 2025

Uh oh!

Cuda-Chen commented Oct 6, 2025

Uh oh!

codecov bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Oct 6, 2025

Benchmark Comparison by Configuration

GET

MGET

MSET

SET

GET

SET

GET

SET

Uh oh!

zuiderkwast commented Oct 7, 2025

Uh oh!

Cuda-Chen commented Oct 7, 2025

Uh oh!

zuiderkwast commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cuda-Chen commented Oct 8, 2025

Environment

Baseline (commit d5bb986fd592733330ff4042c011f7c64da543ec on unstable branch)

PR

Some Thoughts

Uh oh!

zuiderkwast commented Oct 8, 2025

Uh oh!

zuiderkwast commented Oct 8, 2025

Uh oh!

Cuda-Chen commented Oct 10, 2025

Uh oh!

Cuda-Chen commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zuiderkwast commented Oct 14, 2025

Uh oh!

Cuda-Chen commented Oct 14, 2025

Uh oh!

zuiderkwast commented Oct 14, 2025

Uh oh!

Cuda-Chen commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Oct 6, 2025 •

edited

Loading

zuiderkwast commented Oct 7, 2025 •

edited

Loading

Baseline (commit `d5bb986fd592733330ff4042c011f7c64da543ec` on `unstable` branch)

Cuda-Chen commented Oct 14, 2025 •

edited

Loading