-
Notifications
You must be signed in to change notification settings - Fork 5k
Use XxHash3 in String.GetNonRandomizedHashCode #116057
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Tagging subscribers to this area: @dotnet/area-system-runtime |
@EgorBot -amd -intel -arm using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private static Dictionary<string, int> Dictionary
// We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
= new() {{ "1", 2 }};
[Benchmark]
[ArgumentsSource(nameof(GetKeys))]
public bool Contains(string key) => Dictionary.ContainsKey(key);
private static IEnumerable<string> GetKeys()
{
foreach (int len in (int[])[1, 10, 20, 30, 40, 50, 75, 100, 1000, 10000])
yield return new string('a', len);
}
} |
I think the cost for short lengths (especially at cut-off boundaries) can be somewhat improved by massaging the dispatch inside XXH3 (right now it's 3 or so calls for various length ranges) and the inlining of methods it dispatches to. The setup is a bit costly and is likely to lose to FNV-1a on short lengths still. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR integrates XXH3 into String.GetNonRandomizedHashCode
on x64/ARM64, replacing Marvin for large inputs to improve distribution and performance. Shared non-cryptographic hash implementations are linked into CoreLib and the separate System.IO.Hashing package.
- Add
USE_XXHASH3
macro, thresholds, and conditional XXH3 usage inString.GetNonRandomizedHashCode
- Link shared hashing algorithm sources into CoreLib and System.IO.Hashing projects
- Introduce a new resource string (
NotSupported_GetHashCode
) inStrings.resx
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs | Use XXH3 for non-randomized string hashing when span length ≥ threshold |
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems | Include common hashing sources in CoreLib build |
src/libraries/System.Private.CoreLib/src/Resources/Strings.resx | Add new “NotSupported_GetHashCode” resource |
src/libraries/System.IO.Hashing/src/System.IO.Hashing.csproj | Link common hashing sources into the System.IO.Hashing project |
src/libraries/Common/src/System/IO/Hashing/XxHashShared.cs | Conditional documentation for CORELIB build |
src/libraries/Common/src/System/IO/Hashing/XxHash64*.cs & XxHash3.cs & NonCryptographicHashAlgorithm.cs | Adjust visibility and add XXH3 entry point for CORELIB |
Comments suppressed due to low confidence (2)
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs:864
- There are no existing tests covering the new XXH3 code path for different span lengths. Add unit tests that verify correct hash values when
length
is both below and aboveXxHash3Threshold
.
if (length >= XxHash3Threshold)
src/libraries/System.Private.CoreLib/src/Resources/Strings.resx:4364
- The new
NotSupported_GetHashCode
resource is added but never referenced in code. Either remove it or update the implementation to throw or use this message where appropriate.
<data name="NotSupported_GetHashCode" xml:space="preserve">
src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs
Outdated
Show resolved
Hide resolved
@stephentoub @jkotas does this look acceptable? Presumably the same code can be re-used to replace Marvin (but I'm not sure I can prove it is as good as marvin in terms of collisions and ddos protection). Here is the latest benchmark run: EgorBot/runtime-utils#363 (comment) I've unified I suspect we might want to optimize XxHash3 for arm64 e.g. with double vectors to slightly reduce the gap. |
I wrote a simple app that splits large strings into random chunks (1..1000) and calculate number of collisions in a Dictionary for Marvin vs XxHash3. The number of collisions depends on the initial Seed (the variance is pretty small) but overall it seems that the both impl are on par in terms of collisions. |
if (length >= XxHash3Threshold) | ||
{ | ||
uint byteLength = (uint)length * 2; // never overflows | ||
return unchecked((int)System.IO.Hashing.XxHash3.NonRandomizedHashToUInt64((byte*)src, byteLength)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way we can improve the implementation of XxHash3 to the point where we could just always use it rather than having the fallbacks for < XxHash3Threshold? Or it's simply not possible because the number of instructions required is simply higher?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be fair. I am not sure, it seems to be quite noticeably slower for small inputs. Do you mind if I investigate that in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit of slippery slope to mix different hash algorithms for different string lengths. Once you start mixing strings of the length below and above the threshold any of the guarantees about collisions become somewhat weaker and unpredictable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've just pushed an unconditional switch to XxHash3, will see how it goes
src/libraries/Common/src/System/IO/Hashing/NonCryptographicHashAlgorithm.cs
Outdated
Show resolved
Hide resolved
return current; | ||
<= 128 => HashLength17To128(sourcePtr, length, 0UL), | ||
<= MidSizeMaxBytes => HashLength129To240(sourcePtr, length, 0UL), | ||
_ => HashLengthOver240(sourcePtr, length, 0UL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Why change the style from what's used by HashToUInt64 above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
@@ -0,0 +1,170 @@ | |||
// Licensed to the .NET Foundation under one or more agreements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems misnamed. The other Xx.State.cs files are named such because they contain the corresponding Xx.State types. This, however, is the public XxHash3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the common part as .Common.cs (we use this suffix for many other shared files)
|
||
#if SYSTEM_PRIVATE_CORELIB | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
public static int NonRandomizedHashToInt32(byte* sourcePtr, uint length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we still need this method?
@EgorBot -arm -amd -windows_intel using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);
public class Bench
{
private static Dictionary<string, int> Dictionary
// We don't need to fill the dictionary as we're only benchmarking the hashcode operation.
= new() { { "1", 2 } };
[Benchmark]
[ArgumentsSource(nameof(GetKeys))]
public bool Contains(string key) => Dictionary.ContainsKey(key);
public static IEnumerable<string> GetKeys()
{
foreach (int len in (int[])[1, 3, 5, 8, 10, 16, 20, 32, 40, 50, 100, 1000, 10000])
yield return new string('a', len);
}
} |
Try to benchmark it with an input that varies randomly. Jump tables are great for trivial micro-benchmarks, but they are less great for real world code. |
I am not sure I can even theoretically make it work even as is given the complexity difference between the current and xxhash versions for small inputs. Under certain threshold there will always be regressions, e.g. here is the codegen for len1-3: movzx rax, byte ptr [rcx]
shl eax, 16
mov r10d, edx
shr r10d, 1
movzx r10, byte ptr [rcx+r10]
shl r10d, 24
or eax, r10d
lea r10d, [rdx-0x01]
movzx rcx, byte ptr [rcx+r10]
or eax, ecx
shl edx, 8
or eax, edx
mov ecx, 0x87275A9B
add rcx, r8
xor rax, rcx
mov rcx, rax
shr rcx, 33
xor rcx, rax
mov rax, 0xC2B2AE3D27D4EB4F
imul rax, rcx
mov rcx, rax
shr rcx, 29
xor rcx, rax
mov rax, 0x165667B19E3779F9
imul rax, rcx
mov rcx, rax
shr rcx, 32
xor rax, rcx It's a bit simpler if I hard code & duplicate it for 1,2,3, but I can't do it for all lengths anyway. |
That being said, I think there is nothing we can do |
I guess that it's expected to have some regression in a micro-benchmark like this. However, I remember you said that hash quality for short strings is pretty bad, so perhaps the extra cycles spent on hashing can save collisions on some more real-world scenario and make this worth it? |
some numbers are pretty bad, not sure we can sell that as an improvement given strings are usually small rather than large. I don't see obvious issues in the codegen to fix it (besides duplicating code for each length constant) Maybe with some (Static) PGO magic we can switch comparers depending on the input somehow, no idea how that would look like. |
Perhaps worth trying out https://github.com/ogxd/gxhash-csharp/blob/main/GxHash/GxHash.cs. Original source is also MIT licensed https://github.com/ogxd/gxhash. |
Yes, they are, but it's an extremely unrealistic benchmark. You are benchmarking just the hash function in isolation. It would be more reasonable to show how it performs on some realistic workloads. For example, calculating the probability of digraph/trigraphs in a |
Let's see how this goes.
Presumably, it's not only better for large inputs, but also provides much better distribution (which is terrible in the current impl).
Binary size impact on Release corelib on x64 (should be less on arm64):
At the moment it is limited to x64 and arm64 targets, hence, no impact on browser/mobiles.
We can also replace Marvin with XxHash32: #85206 (but not in this PR)