Skip to content

Conversation

Cuda-Chen
Copy link

No description provided.

@Cuda-Chen
Copy link
Author

Hi @roshkhatri ,

I would like to request your help of triggering the run-benchmark CI, thanks!

@Cuda-Chen Cuda-Chen force-pushed the crc16-x86-barrent branch 3 times, most recently from eefa2e0 to b93048a Compare October 6, 2025 14:55
Copy link

codecov bot commented Oct 6, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.59%. Comparing base (d5bb986) to head (05b1c2c).
⚠️ Report is 16 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2691      +/-   ##
============================================
+ Coverage     72.19%   72.59%   +0.40%     
============================================
  Files           128      128              
  Lines         71005    71290     +285     
============================================
+ Hits          51263    51756     +493     
+ Misses        19742    19534     -208     
Files with missing lines Coverage Δ
src/crc16.c 100.00% <100.00%> (ø)

... and 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link

github-actions bot commented Oct 6, 2025

Benchmark ran on this commit: b93048a

Benchmark Comparison by Configuration

Configuration:

  • clients: 50
  • cluster_mode: False
  • data_size: 16
  • pipeline: 10
  • requests: 100000000
  • tls: False
  • warmup: 120

GET

Metric Baseline PR Diff % Change
rps 831158.44 836393.19 5234.75 +0.63%
latency_avg_ms 0.54 0.53 -0.01 -0.93%
latency_p50_ms 0.52 0.51 -0.01 -1.54%
latency_p95_ms 0.72 0.71 -0.01 -1.11%
latency_p99_ms 0.81 0.80 -0.01 -0.99%

MGET

Metric Baseline PR Diff % Change
rps 298839.31 299801.53 962.22 +0.32%
latency_avg_ms 1.55 1.54 -0.01 -0.52%
latency_p50_ms 1.57 1.56 -0.02 -1.02%
latency_p95_ms 1.90 1.89 -0.01 -0.42%
latency_p99_ms 1.97 1.95 -0.02 -0.81%

MSET

Metric Baseline PR Diff % Change
rps 226919.45 227264.97 345.52 +0.15%
latency_avg_ms 2.12 2.12 -0.00 -0.14%
latency_p50_ms 2.33 2.33 0.00 +0.00%
latency_p95_ms 2.60 2.60 0.00 +0.00%
latency_p99_ms 2.68 2.68 0.00 +0.00%

SET

Metric Baseline PR Diff % Change
rps 756212.25 754221.75 -1990.50 -0.26%
latency_avg_ms 0.59 0.60 0.00 +0.34%
latency_p50_ms 0.57 0.57 0.00 +0.00%
latency_p95_ms 0.79 0.79 0.00 +0.00%
latency_p99_ms 0.87 0.87 0.00 +0.00%

Configuration:

  • clients: 50
  • cluster_mode: True
  • data_size: 16
  • pipeline: 10
  • requests: 100000000
  • tls: False
  • warmup: 120

GET

Metric Baseline PR Diff % Change
rps 746642.00 711642.44 -34999.56 -4.69%
latency_avg_ms 0.60 0.64 0.03 +5.29%
latency_p50_ms 0.57 0.61 0.03 +5.57%
latency_p95_ms 0.80 0.84 0.04 +5.01%
latency_p99_ms 0.89 0.92 0.03 +3.61%

SET

Metric Baseline PR Diff % Change
rps 650296.50 621067.88 -29228.62 -4.49%
latency_avg_ms 0.70 0.74 0.04 +4.96%
latency_p50_ms 0.68 0.73 0.05 +7.07%
latency_p95_ms 0.94 0.97 0.04 +4.28%
latency_p99_ms 1.10 1.12 0.02 +1.45%

Configuration:

  • clients: 650
  • cluster_mode: False
  • data_size: 512
  • io_threads: 8
  • pipeline: 10
  • requests: 100000000
  • tls: False
  • warmup: 180

GET

Metric Baseline PR Diff % Change
rps 808924.06 800531.56 -8392.50 -1.04%
latency_avg_ms 4.06 4.11 0.04 +1.06%
latency_p50_ms 4.06 4.11 0.05 +1.18%
latency_p95_ms 4.47 4.54 0.07 +1.61%
latency_p99_ms 4.58 4.65 0.07 +1.57%

SET

Metric Baseline PR Diff % Change
rps 815361.44 809198.94 -6162.50 -0.76%
latency_avg_ms 4.05 4.08 0.03 +0.77%
latency_p50_ms 4.05 4.07 0.02 +0.59%
latency_p95_ms 4.38 4.38 0.00 +0.00%
latency_p99_ms 4.61 4.54 -0.07 -1.56%

@zuiderkwast
Copy link
Contributor

I believe the benchmarks done with valkey-benchmark in cluster mode uses keys on the form key{XYZ}:1234567890. When curly braces are used within a key name, only the chars between {...} are hashed using CRC16, instead of the whole key. The {...} syntax is a common and useful way to make different keys be hashed to the same cluster slot and be stored in the same cluster node, but in real world scenarios, the part between the curly braces is typically longer. Let's say that part is often 10-20 bytes. Valkey-benchmark only puts 2-3 chars between braces.

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

So are there any issues about refining the benchmark?
I think I can give a hand of such issues.
Or, maybe I should design another benchmarking scenario?

@zuiderkwast
Copy link
Contributor

zuiderkwast commented Oct 7, 2025

Hi @zuiderkwast ,

So in this sense, this is maybe not a fair way to benchmark the crc16 implementation.

So are there any issues about refining the benchmark? I think I can give a hand of such issues. Or, maybe I should design another benchmarking scenario?h

Optimizing the CRC16 is the only scenario where this matters, so I guess it's not worth changing the way valkey-benchmark creates keys to route them to each cluster node. (It does this in a somewhat reverse way: It selects a cluster node and then it creates a key that maps to that node.)

What you can do instead is to use a single valkey node running in cluster mode and owning all the slots. Then run valkey-benchmark in non-cluster mode. Valkey-benchmark will not be aware that it's connected to a cluster so the keys will not contain the curly braces. They will look like key:1234567890 instead. The valkey node will however need to compute the CRC16 of each key to lookup the cluster slot. I think it will be more realistic in terms of CRC16 subject length.

Something like this:

$ cd src
$ rm dump.rdb nodes.conf # clean up any old data files (if they exist)
$ ./valkey-server --cluster-enabled yes --save '' &
(...)
2622423:M 07 Oct 2025 17:48:58.872 * Server initialized
2622423:M 07 Oct 2025 17:48:58.873 * Ready to accept connections tcp

$ ./valkey-cli ./valkey-cli cluster addslotsrange 0 16383
OK
$ ./valkey-cli cluster info | head -3
cluster_state:ok
cluster_slots_assigned:16384
cluster_slots_ok:16384

Now it should be possible to run valkey-benchmark against this node.

$ ./valkey-benchmark --threads 3 -P 10 -n 10000000 -r 1000000 -t set,get -q

Note that the benchark creates keys and it doesn't delete them afterwards, so you can either restart the server or delete all keys using ./valkey-cli flushall before running the benchmark again.

An idea: For testing the CRC16, it should be enough to run only the GET command with no keys stored on the server. The server will still need to run the CRC16 calculation. For this, use -t get instead of -t set,get.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

Thanks for your procedures of benchmarking.

I would like to show some benchmark results:

Environment

  • CPU: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
  • OS: Linux 5.15.0-157-generic

Baseline (commit d5bb986fd592733330ff4042c011f7c64da543ec on unstable branch)

# -t set,get
SET: 293151.97 requests per second, p50=1.455 msec                    
GET: 369248.94 requests per second, p50=1.167 msec

# -t get
GET: 458442.22 requests per second, p50=0.983 msec

PR

# -t set,get
SET: 252194.08 requests per second, p50=1.615 msec                    
GET: 309042.56 requests per second, p50=1.319 msec  

# -t get
GET: 335615.53 requests per second, p50=1.167 msec 

Some Thoughts

  • The CLMUL implementation uses bit reflected implementation, which leaves the room of improvement.

@zuiderkwast
Copy link
Contributor

I guess CLMUL is using more instructions than the table lookup. Maybe the current lookup table implementation is actually optimal for our use case...?

@zuiderkwast
Copy link
Contributor

... or will you try to get rid of the bit reverse code and then maybe CLMUL is actually faster? You can try if you want.

Did you forget to git add src/unit/test_crc16.c?

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

I guess CLMUL is using more instructions than the table lookup. Maybe the current lookup table implementation is actually optimal for our use case...?

As the input length is just 10~20 bytes, I think we can try multiple bytes tabular LUT (the CLMUL usually applies on much longer length input such as 4 KB).

... or will you try to get rid of the bit reverse code and then maybe CLMUL is actually faster? You can try if you want.

I find a possible solution that no need to do bit reverse (https://github.com/awesomized/crc-fast-rust/blob/main/src/crc32/algorithm.rs#L164). I will do some experiments to check whether a bit shifting can eliminate bit reverse is possible.

Did you forget to git add src/unit/test_crc16.c?

Thanks for reminding me. I will commit later.

@Cuda-Chen
Copy link
Author

Cuda-Chen commented Oct 14, 2025

Hi @zuiderkwast ,

I would like to share some experiment result.
I implement the forward calculation (with some optimizations), and the following block shows the benchmark:

# -t set,get
SET: 265674.81 requests per second, p50=1.615 msec                    
GET: 299652.38 requests per second, p50=1.351 msec

# -t get
GET: 366501.75 requests per second, p50=1.167 msec 

I think the reason that the forward implementation gets inferior performance compared to baseline is that the input message length is just 10 ~ 20 bytes (the CLMUL instruction needs 3 ~ 8 cycles).

@zuiderkwast
Copy link
Contributor

I think the reason that the forward implementation gets inferior performance compared to baseline is that the input message length is just 10 ~ 20 bytes (the CLMUL instruction needs 3 ~ 8 cycles).

Yes, it makes sense.

I found in this code https://github.com/TobiasBengtsson/crc-fast-rs/blob/master/crc-fast-gen/src/lib.rs#L222 they use the table lookup for payload < 128 and they use SIMD algorithms only for larger payload. So maybe the table lookup is optimal for our use case? That is my guess.

@Cuda-Chen
Copy link
Author

Hi @zuiderkwast ,

So maybe the table lookup is optimal for our use case? That is my guess.

I agree. I think I will work on table lookup with multiple bytes.

If there are no any further concerns, I will close this PR.

@zuiderkwast
Copy link
Contributor

I agree. I think I will work on table lookup with multiple bytes.

A table with multiple bytes means a much larger table than the current 512 byte table? I think it will fill up the L1 cache, causing evictions of other important data that we want to keep in the L1 cache. I'm not sure it's a good idea. Are you sure you want to try it?

If there are no any further concerns, I will close this PR.

Yes, you can close this PR. Thank you for your effort!

@Cuda-Chen
Copy link
Author

A table with multiple bytes means a much larger table than the current 512 byte table?

Yes. Taking processing 4 bytes at once, this will create four pre-computed tables, which becomes 2KB (512 * 4).

I think it will fill up the L1 cache, causing evictions of other important data that we want to keep in the L1 cache. I'm not sure it's a good idea. Are you sure you want to try it?

I will give a try, and if I get an inferior result on benchmark. I will just leave some findings in related issue so that we have a record.

@Cuda-Chen Cuda-Chen closed this Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants