Use XxHash3 in String.GetNonRandomizedHashCode #116057

EgorBo · 2025-05-28T03:55:01Z

Let's see how this goes.

Presumably, it's not only better for large inputs, but also provides much better distribution (which is terrible in the current impl).

Binary size impact on Release corelib on x64 (should be less on arm64):

IL-only size: +6kb (which is +0.13% size regression)
R2R'd size: +12kb (which is +0.08% size regression)

At the moment it is limited to x64 and arm64 targets, hence, no impact on browser/mobiles.

We can also replace Marvin with XxHash32: #85206 (but not in this PR)

dotnet-policy-service · 2025-05-28T04:03:14Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

EgorBo · 2025-05-28T04:10:27Z

@EgorBot -amd -intel -arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private static Dictionary<string, int> Dictionary
        // We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
        = new() {{ "1", 2 }};

    [Benchmark]
    [ArgumentsSource(nameof(GetKeys))]
    public bool Contains(string key) => Dictionary.ContainsKey(key);

    private static IEnumerable<string> GetKeys()
    {
        foreach (int len in (int[])[1, 10, 20, 30, 40, 50, 75, 100, 1000, 10000])
            yield return new string('a', len);
    }
}

neon-sunset · 2025-05-28T06:49:04Z

I think the cost for short lengths (especially at cut-off boundaries) can be somewhat improved by massaging the dispatch inside XXH3 (right now it's 3 or so calls for various length ranges) and the inlining of methods it dispatches to. The setup is a bit costly and is likely to lose to FNV-1a on short lengths still.

Copilot

Pull Request Overview

This PR integrates XXH3 into String.GetNonRandomizedHashCode on x64/ARM64, replacing Marvin for large inputs to improve distribution and performance. Shared non-cryptographic hash implementations are linked into CoreLib and the separate System.IO.Hashing package.

Add USE_XXHASH3 macro, thresholds, and conditional XXH3 usage in String.GetNonRandomizedHashCode
Link shared hashing algorithm sources into CoreLib and System.IO.Hashing projects
Introduce a new resource string (NotSupported_GetHashCode) in Strings.resx

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs	Use XXH3 for non-randomized string hashing when span length ≥ threshold
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems	Include common hashing sources in CoreLib build
src/libraries/System.Private.CoreLib/src/Resources/Strings.resx	Add new “NotSupported_GetHashCode” resource
src/libraries/System.IO.Hashing/src/System.IO.Hashing.csproj	Link common hashing sources into the System.IO.Hashing project
src/libraries/Common/src/System/IO/Hashing/XxHashShared.cs	Conditional documentation for CORELIB build
src/libraries/Common/src/System/IO/Hashing/XxHash64*.cs & XxHash3.cs & NonCryptographicHashAlgorithm.cs	Adjust visibility and add XXH3 entry point for CORELIB

Comments suppressed due to low confidence (2)

src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs:864

There are no existing tests covering the new XXH3 code path for different span lengths. Add unit tests that verify correct hash values when length is both below and above XxHash3Threshold.

if (length >= XxHash3Threshold)

src/libraries/System.Private.CoreLib/src/Resources/Strings.resx:4364

The new NotSupported_GetHashCode resource is added but never referenced in code. Either remove it or update the implementation to throw or use this message where appropriate.

<data name="NotSupported_GetHashCode" xml:space="preserve">

src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs

EgorBo · 2025-05-28T14:08:16Z

@stephentoub @jkotas does this look acceptable? Presumably the same code can be re-used to replace Marvin (but I'm not sure I can prove it is as good as marvin in terms of collisions and ddos protection).

Here is the latest benchmark run: EgorBot/runtime-utils#363 (comment)

I've unified GetNonRandomizedHashCode() with GetNonRandomizedHashCode(ROS<char>). The latter is a bit more optimized towards 1,2,3 lengths, hence, a minor improvement there. There is some small level of variance, but not sure +-0.5ns can be taken seriously.

I suspect we might want to optimize XxHash3 for arm64 e.g. with double vectors to slightly reduce the gap.

EgorBo · 2025-05-28T15:55:12Z

I wrote a simple app that splits large strings into random chunks (1..1000) and calculate number of collisions in a Dictionary for Marvin vs XxHash3. The number of collisions depends on the initial Seed (the variance is pretty small) but overall it seems that the both impl are on par in terms of collisions.

stephentoub · 2025-05-28T18:07:40Z

src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs

+                        if (length >= XxHash3Threshold)
+                        {
+                            uint byteLength = (uint)length * 2; // never overflows
+                            return unchecked((int)System.IO.Hashing.XxHash3.NonRandomizedHashToUInt64((byte*)src, byteLength));


Is there any way we can improve the implementation of XxHash3 to the point where we could just always use it rather than having the fallbacks for < XxHash3Threshold? Or it's simply not possible because the number of instructions required is simply higher?

to be fair. I am not sure, it seems to be quite noticeably slower for small inputs. Do you mind if I investigate that in a separate PR?

It's a bit of slippery slope to mix different hash algorithms for different string lengths. Once you start mixing strings of the length below and above the threshold any of the guarantees about collisions become somewhat weaker and unpredictable.

Yeah I've just pushed an unconditional switch to XxHash3, will see how it goes

src/libraries/Common/src/System/IO/Hashing/NonCryptographicHashAlgorithm.cs

src/libraries/Common/src/System/IO/Hashing/XxHash64.cs

stephentoub · 2025-05-29T01:13:36Z

src/libraries/Common/src/System/IO/Hashing/XxHash3.cs

-            return current;
+                <= 128 => HashLength17To128(sourcePtr, length, 0UL),
+                <= MidSizeMaxBytes => HashLength129To240(sourcePtr, length, 0UL),
+                _ => HashLengthOver240(sourcePtr, length, 0UL)


Nit: Why change the style from what's used by HashToUInt64 above?

stephentoub · 2025-05-29T01:16:30Z

src/libraries/System.IO.Hashing/src/System/IO/Hashing/XxHash3.State.cs

@@ -0,0 +1,170 @@
+// Licensed to the .NET Foundation under one or more agreements.


This file seems misnamed. The other Xx.State.cs files are named such because they contain the corresponding Xx.State types. This, however, is the public XxHash3.

Renamed the common part as .Common.cs (we use this suffix for many other shared files)

…-corelib2

stephentoub · 2025-05-29T13:05:05Z

src/libraries/Common/src/System/IO/Hashing/XxHash3.Common.cs

+
+#if SYSTEM_PRIVATE_CORELIB
+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        public static int NonRandomizedHashToInt32(byte* sourcePtr, uint length)


Why do we still need this method?

EgorBo · 2025-05-29T13:32:39Z

@EgorBot -arm -amd -windows_intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private static Dictionary<string, int> Dictionary
        // We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
        = new() { { "1", 2 } };

    [Benchmark]
    [ArgumentsSource(nameof(GetKeys))]
    public bool Contains(string key) => Dictionary.ContainsKey(key);

    public static IEnumerable<string> GetKeys()
    {
        foreach (int len in (int[])[1, 3, 5, 8, 10, 16, 20, 32, 40, 50, 100, 1000, 10000])
            yield return new string('a', len);
    }
}

jkotas · 2025-05-29T14:05:32Z

use jump table

Try to benchmark it with an input that varies randomly. Jump tables are great for trivial micro-benchmarks, but they are less great for real world code.

EgorBo · 2025-05-29T14:14:58Z

use jump table

Try to benchmark it with an input that varies randomly. Jump tables are great for trivial micro-benchmarks, but they are less great for real world code.

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

       movzx    rax, byte  ptr [rcx]
       shl      eax, 16
       mov      r10d, edx
       shr      r10d, 1
       movzx    r10, byte  ptr [rcx+r10]
       shl      r10d, 24
       or       eax, r10d
       lea      r10d, [rdx-0x01]
       movzx    rcx, byte  ptr [rcx+r10]
       or       eax, ecx
       shl      edx, 8
       or       eax, edx
       mov      ecx, 0x87275A9B
       add      rcx, r8
       xor      rax, rcx
       mov      rcx, rax
       shr      rcx, 33
       xor      rcx, rax
       mov      rax, 0xC2B2AE3D27D4EB4F
       imul     rax, rcx
       mov      rcx, rax
       shr      rcx, 29
       xor      rcx, rax
       mov      rax, 0x165667B19E3779F9
       imul     rax, rcx
       mov      rcx, rax
       shr      rcx, 32
       xor      rax, rcx

It's a bit simpler if I hard code & duplicate it for 1,2,3, but I can't do it for all lengths anyway.

EgorBo · 2025-05-29T14:19:49Z

That being said, I think there is nothing we can do

filipnavara · 2025-05-29T14:26:22Z

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

I guess that it's expected to have some regression in a micro-benchmark like this. However, I remember you said that hash quality for short strings is pretty bad, so perhaps the extra cycles spent on hashing can save collisions on some more real-world scenario and make this worth it?

EgorBo · 2025-05-29T14:31:26Z

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

I guess that it's expected to have some regression in a micro-benchmark like this. However, I remember you said that hash quality for short strings is pretty bad, so perhaps the extra cycles spent on hashing can save collisions on some more real-world scenario and make this worth it?

some numbers are pretty bad, not sure we can sell that as an improvement given strings are usually small rather than large. I don't see obvious issues in the codegen to fix it (besides duplicating code for each length constant)

Maybe with some (Static) PGO magic we can switch comparers depending on the input somehow, no idea how that would look like.

am11 · 2025-05-29T15:01:24Z

#85206 (comment)

Perhaps worth trying out https://github.com/ogxd/gxhash-csharp/blob/main/GxHash/GxHash.cs. Original source is also MIT licensed https://github.com/ogxd/gxhash.

filipnavara · 2025-05-29T15:03:59Z

some numbers are pretty bad, not sure we can sell that as an improvement given strings are usually small rather than large. I don't see obvious issues in the codegen to fix it (besides duplicating code for each length constant)

Yes, they are, but it's an extremely unrealistic benchmark. You are benchmarking just the hash function in isolation. It would be more reasonable to show how it performs on some realistic workloads. For example, calculating the probability of digraph/trigraphs in a Dictionary<string, int> prefilled with some data (let's say you collect the digraph+trigraphs from a random Wikipedia article) is something one could use to calculate language similarity. If the current hash is calculated fast but gives crappy performance in terms of collisions then it may be still a win for xxHash. This scenario is reasonably easy to write and benchmark, both building the index and then checking some input against the index.

github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2025

EgorBo added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 28, 2025

dotnet-policy-service bot assigned EgorBo May 28, 2025

EgorBot mentioned this pull request May 28, 2025

Benchmarks for #116057 (EgorBo) EgorBot/runtime-utils#363

Open

build-analysis bot mentioned this pull request May 28, 2025

System.Runtime.Tests.WorkItemExecution "crash" on tvOS #106319

Open

EgorBo force-pushed the xxhash3-corelib2 branch from 982d0d3 to 2ac83d3 Compare May 28, 2025 12:45

EgorBo marked this pull request as ready for review May 28, 2025 14:03

Copilot AI review requested due to automatic review settings May 28, 2025 14:03

Copilot AI reviewed May 28, 2025

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request May 28, 2025

System.Net.Http.Functional.Tests timeouts #115683

Open

stephentoub reviewed May 28, 2025

View reviewed changes

src/libraries/Common/src/System/IO/Hashing/NonCryptographicHashAlgorithm.cs Outdated Show resolved Hide resolved

stephentoub reviewed May 28, 2025

View reviewed changes

src/libraries/Common/src/System/IO/Hashing/XxHash64.cs Outdated Show resolved Hide resolved

EgorBo added 2 commits May 28, 2025 21:42

Split files

7271b28

Use in corelib

da97666

EgorBo force-pushed the xxhash3-corelib2 branch from 5627695 to da97666 Compare May 28, 2025 19:54

Merge branch 'main' into xxhash3-corelib2

87b7301

EgorBo closed this May 28, 2025

EgorBo reopened this May 28, 2025

stephentoub reviewed May 29, 2025

View reviewed changes

EgorBo added 3 commits May 29, 2025 14:20

Merge branch 'main' of https://github.com/dotnet/runtime into xxhash3…

71a50a5

…-corelib2

Address feedback

f5be6f5

fix build

12ee5de

Use xxhash3 for all inputs

1ca84ab

stephentoub reviewed May 29, 2025

View reviewed changes

use jump table

8d01c38

EgorBot mentioned this pull request May 29, 2025

Benchmarks for #116057 (EgorBo) EgorBot/runtime-utils#364

Open

EgorBo closed this May 29, 2025

		@@ -0,0 +1,170 @@
		// Licensed to the .NET Foundation under one or more agreements.

Use XxHash3 in String.GetNonRandomizedHashCode #116057

Use XxHash3 in String.GetNonRandomizedHashCode #116057

Uh oh!

Conversation

EgorBo commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dotnet-policy-service bot commented May 28, 2025

Uh oh!

EgorBo commented May 28, 2025

Uh oh!

neon-sunset commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

EgorBo commented May 28, 2025

Uh oh!

EgorBo commented May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EgorBo May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EgorBo commented May 29, 2025

Uh oh!

jkotas commented May 29, 2025

Uh oh!

EgorBo commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented May 29, 2025

Uh oh!

filipnavara commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am11 commented May 29, 2025

Uh oh!

filipnavara commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EgorBo commented May 28, 2025 •

edited

Loading

neon-sunset commented May 28, 2025 •

edited

Loading

EgorBo commented May 28, 2025 •

edited

Loading

EgorBo May 28, 2025 •

edited

Loading

EgorBo commented May 29, 2025 •

edited

Loading

filipnavara commented May 29, 2025 •

edited

Loading

EgorBo commented May 29, 2025 •

edited

Loading

filipnavara commented May 29, 2025 •

edited

Loading