Skip to content

Use XxHash3 in String.GetNonRandomizedHashCode #116057

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented May 28, 2025

Let's see how this goes.

Presumably, it's not only better for large inputs, but also provides much better distribution (which is terrible in the current impl).

Binary size impact on Release corelib on x64 (should be less on arm64):

IL-only size: +6kb (which is +0.13% size regression)
R2R'd size: +12kb (which is +0.08% size regression)

At the moment it is limited to x64 and arm64 targets, hence, no impact on browser/mobiles.

We can also replace Marvin with XxHash32: #85206 (but not in this PR)

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label May 28, 2025
@EgorBo EgorBo added area-System.Runtime and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels May 28, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

@EgorBo
Copy link
Member Author

EgorBo commented May 28, 2025

@EgorBot -amd -intel -arm

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private static Dictionary<string, int> Dictionary
        // We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
        = new() {{ "1", 2 }};

    [Benchmark]
    [ArgumentsSource(nameof(GetKeys))]
    public bool Contains(string key) => Dictionary.ContainsKey(key);

    private static IEnumerable<string> GetKeys()
    {
        foreach (int len in (int[])[1, 10, 20, 30, 40, 50, 75, 100, 1000, 10000])
            yield return new string('a', len);
    }
}

@neon-sunset
Copy link
Contributor

neon-sunset commented May 28, 2025

I think the cost for short lengths (especially at cut-off boundaries) can be somewhat improved by massaging the dispatch inside XXH3 (right now it's 3 or so calls for various length ranges) and the inlining of methods it dispatches to. The setup is a bit costly and is likely to lose to FNV-1a on short lengths still.

@EgorBo EgorBo force-pushed the xxhash3-corelib2 branch from 982d0d3 to 2ac83d3 Compare May 28, 2025 12:45
@EgorBo EgorBo marked this pull request as ready for review May 28, 2025 14:03
@Copilot Copilot AI review requested due to automatic review settings May 28, 2025 14:03
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR integrates XXH3 into String.GetNonRandomizedHashCode on x64/ARM64, replacing Marvin for large inputs to improve distribution and performance. Shared non-cryptographic hash implementations are linked into CoreLib and the separate System.IO.Hashing package.

  • Add USE_XXHASH3 macro, thresholds, and conditional XXH3 usage in String.GetNonRandomizedHashCode
  • Link shared hashing algorithm sources into CoreLib and System.IO.Hashing projects
  • Introduce a new resource string (NotSupported_GetHashCode) in Strings.resx

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs Use XXH3 for non-randomized string hashing when span length ≥ threshold
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems Include common hashing sources in CoreLib build
src/libraries/System.Private.CoreLib/src/Resources/Strings.resx Add new “NotSupported_GetHashCode” resource
src/libraries/System.IO.Hashing/src/System.IO.Hashing.csproj Link common hashing sources into the System.IO.Hashing project
src/libraries/Common/src/System/IO/Hashing/XxHashShared.cs Conditional documentation for CORELIB build
src/libraries/Common/src/System/IO/Hashing/XxHash64*.cs & XxHash3.cs & NonCryptographicHashAlgorithm.cs Adjust visibility and add XXH3 entry point for CORELIB
Comments suppressed due to low confidence (2)

src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs:864

  • There are no existing tests covering the new XXH3 code path for different span lengths. Add unit tests that verify correct hash values when length is both below and above XxHash3Threshold.
if (length >= XxHash3Threshold)

src/libraries/System.Private.CoreLib/src/Resources/Strings.resx:4364

  • The new NotSupported_GetHashCode resource is added but never referenced in code. Either remove it or update the implementation to throw or use this message where appropriate.
<data name="NotSupported_GetHashCode" xml:space="preserve">

@EgorBo
Copy link
Member Author

EgorBo commented May 28, 2025

@stephentoub @jkotas does this look acceptable? Presumably the same code can be re-used to replace Marvin (but I'm not sure I can prove it is as good as marvin in terms of collisions and ddos protection).

Here is the latest benchmark run: EgorBot/runtime-utils#363 (comment)

I've unified GetNonRandomizedHashCode() with GetNonRandomizedHashCode(ROS<char>). The latter is a bit more optimized towards 1,2,3 lengths, hence, a minor improvement there. There is some small level of variance, but not sure +-0.5ns can be taken seriously.

I suspect we might want to optimize XxHash3 for arm64 e.g. with double vectors to slightly reduce the gap.

@EgorBo
Copy link
Member Author

EgorBo commented May 28, 2025

I wrote a simple app that splits large strings into random chunks (1..1000) and calculate number of collisions in a Dictionary for Marvin vs XxHash3. The number of collisions depends on the initial Seed (the variance is pretty small) but overall it seems that the both impl are on par in terms of collisions.

if (length >= XxHash3Threshold)
{
uint byteLength = (uint)length * 2; // never overflows
return unchecked((int)System.IO.Hashing.XxHash3.NonRandomizedHashToUInt64((byte*)src, byteLength));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we can improve the implementation of XxHash3 to the point where we could just always use it rather than having the fallbacks for < XxHash3Threshold? Or it's simply not possible because the number of instructions required is simply higher?

Copy link
Member Author

@EgorBo EgorBo May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be fair. I am not sure, it seems to be quite noticeably slower for small inputs. Do you mind if I investigate that in a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit of slippery slope to mix different hash algorithms for different string lengths. Once you start mixing strings of the length below and above the threshold any of the guarantees about collisions become somewhat weaker and unpredictable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I've just pushed an unconditional switch to XxHash3, will see how it goes

@EgorBo EgorBo force-pushed the xxhash3-corelib2 branch from 5627695 to da97666 Compare May 28, 2025 19:54
@EgorBo EgorBo closed this May 28, 2025
@EgorBo EgorBo reopened this May 28, 2025
return current;
<= 128 => HashLength17To128(sourcePtr, length, 0UL),
<= MidSizeMaxBytes => HashLength129To240(sourcePtr, length, 0UL),
_ => HashLengthOver240(sourcePtr, length, 0UL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Why change the style from what's used by HashToUInt64 above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -0,0 +1,170 @@
// Licensed to the .NET Foundation under one or more agreements.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems misnamed. The other Xx.State.cs files are named such because they contain the corresponding Xx.State types. This, however, is the public XxHash3.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the common part as .Common.cs (we use this suffix for many other shared files)


#if SYSTEM_PRIVATE_CORELIB
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static int NonRandomizedHashToInt32(byte* sourcePtr, uint length)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we still need this method?

@EgorBo
Copy link
Member Author

EgorBo commented May 29, 2025

@EgorBot -arm -amd -windows_intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

public class Bench
{
    private static Dictionary<string, int> Dictionary
        // We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
        = new() { { "1", 2 } };

    [Benchmark]
    [ArgumentsSource(nameof(GetKeys))]
    public bool Contains(string key) => Dictionary.ContainsKey(key);

    public static IEnumerable<string> GetKeys()
    {
        foreach (int len in (int[])[1, 3, 5, 8, 10, 16, 20, 32, 40, 50, 100, 1000, 10000])
            yield return new string('a', len);
    }
}

@jkotas
Copy link
Member

jkotas commented May 29, 2025

use jump table

Try to benchmark it with an input that varies randomly. Jump tables are great for trivial micro-benchmarks, but they are less great for real world code.

@EgorBo
Copy link
Member Author

EgorBo commented May 29, 2025

use jump table

Try to benchmark it with an input that varies randomly. Jump tables are great for trivial micro-benchmarks, but they are less great for real world code.

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

       movzx    rax, byte  ptr [rcx]
       shl      eax, 16
       mov      r10d, edx
       shr      r10d, 1
       movzx    r10, byte  ptr [rcx+r10]
       shl      r10d, 24
       or       eax, r10d
       lea      r10d, [rdx-0x01]
       movzx    rcx, byte  ptr [rcx+r10]
       or       eax, ecx
       shl      edx, 8
       or       eax, edx
       mov      ecx, 0x87275A9B
       add      rcx, r8
       xor      rax, rcx
       mov      rcx, rax
       shr      rcx, 33
       xor      rcx, rax
       mov      rax, 0xC2B2AE3D27D4EB4F
       imul     rax, rcx
       mov      rcx, rax
       shr      rcx, 29
       xor      rcx, rax
       mov      rax, 0x165667B19E3779F9
       imul     rax, rcx
       mov      rcx, rax
       shr      rcx, 32
       xor      rax, rcx

It's a bit simpler if I hard code & duplicate it for 1,2,3, but I can't do it for all lengths anyway.

@EgorBo
Copy link
Member Author

EgorBo commented May 29, 2025

That being said, I think there is nothing we can do

@EgorBo EgorBo closed this May 29, 2025
@filipnavara
Copy link
Member

filipnavara commented May 29, 2025

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

I guess that it's expected to have some regression in a micro-benchmark like this. However, I remember you said that hash quality for short strings is pretty bad, so perhaps the extra cycles spent on hashing can save collisions on some more real-world scenario and make this worth it?

@EgorBo
Copy link
Member Author

EgorBo commented May 29, 2025

I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3:

I guess that it's expected to have some regression in a micro-benchmark like this. However, I remember you said that hash quality for short strings is pretty bad, so perhaps the extra cycles spent on hashing can save collisions on some more real-world scenario and make this worth it?

some numbers are pretty bad, not sure we can sell that as an improvement given strings are usually small rather than large. I don't see obvious issues in the codegen to fix it (besides duplicating code for each length constant)

Maybe with some (Static) PGO magic we can switch comparers depending on the input somehow, no idea how that would look like.

@am11
Copy link
Member

am11 commented May 29, 2025

#85206 (comment)

Perhaps worth trying out https://github.com/ogxd/gxhash-csharp/blob/main/GxHash/GxHash.cs. Original source is also MIT licensed https://github.com/ogxd/gxhash.

@filipnavara
Copy link
Member

filipnavara commented May 29, 2025

some numbers are pretty bad, not sure we can sell that as an improvement given strings are usually small rather than large. I don't see obvious issues in the codegen to fix it (besides duplicating code for each length constant)

Yes, they are, but it's an extremely unrealistic benchmark. You are benchmarking just the hash function in isolation. It would be more reasonable to show how it performs on some realistic workloads. For example, calculating the probability of digraph/trigraphs in a Dictionary<string, int> prefilled with some data (let's say you collect the digraph+trigraphs from a random Wikipedia article) is something one could use to calculate language similarity. If the current hash is calculated fast but gives crappy performance in terms of collisions then it may be still a win for xxHash. This scenario is reasonably easy to write and benchmark, both building the index and then checking some input against the index.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants