Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PERFORMANCE QUESTION] Why is DashMap faster with less threads? #177

Open
redactedontop opened this issue Feb 16, 2025 · 21 comments
Open

[PERFORMANCE QUESTION] Why is DashMap faster with less threads? #177

redactedontop opened this issue Feb 16, 2025 · 21 comments
Labels
enhancement New feature or request

Comments

@redactedontop
Copy link

redactedontop commented Feb 16, 2025

Heya!

As shown on the graph below, DashMap is a lot faster (throughput; 10 mega operations faster than scc) for 1 thread (scc eventually surpasses it). I'm a complete noob in optimization, and this is a genuine (and hopefully not rude) question: what are the big differences in implementations that make SCC's HashMap slower short-term but scale better?

Image

Thanks,
Alex.

@redactedontop
Copy link
Author

redactedontop commented Feb 16, 2025

Maybe the use of Atomics on SCC?

@wvwwvwwv
Copy link
Owner

Hi @redactedontop,

DashMap is based on hashbrown, which is very aggressively optimized for single-threaded workloads, whereas SCC focuses more on predictability and scalability.

  • The primary use case of SCC is RDBMS, where the predictable latency is of utmost importance (the tail latency is important), no matter what happens.

If one needs good single-threaded performance and latency spikes don't quite matter, then DashMap is the way to go. If good read performance is essential and the cost of frequent heap allocation (per-entry heap allocation) is not a problem, papaya is the best choice.

  • SCC is good when highly concurrent access to the hashmap is expected, and predictable latencies are essential.

@wvwwvwwv wvwwvwwv added the question Further information is requested label Feb 16, 2025
@wvwwvwwv
Copy link
Owner

I forgot to answer the main part of the question; its single-threaded performance is not as good as others because I didn't pay much attention to this :-( I'll try to optimize it when I have some time.

@redactedontop
Copy link
Author

redactedontop commented Feb 17, 2025

Hiya!

I'll gladly help you, as I enjoy this project a lot: give me a checklist and some days, and I'll see what I can do ;)

Thanks,
Alex <3
PS: It helps both you (less work), me (highly parallelized chess engine), and users (free performance!!!1!1)

@wvwwvwwv
Copy link
Owner

wvwwvwwv commented Feb 17, 2025

@redactedontop awesome! The first step of optimization is usually profiling. On macOS, the profiling tool (Instruments) bundled with XCode is pretty powerful, so once a benchmark scenario is prepared (vanilla conc-map-bench is too short - need a longer one to get meaningful results), it will be possible to figure out where to optimize in the code.
-> No pressure at all!

@redactedontop
Copy link
Author

Would it be selfish to ask you for a flamegraph (cargo flamegraph) of a bigger benchmark for me to analyse? I currently cannot make benchmarks (studying + homework + personal projects)...?

@redactedontop
Copy link
Author

And do you have discord (quicker answers)?

@wvwwvwwv
Copy link
Owner

I can do profiling, but not sure when I can - possibly early Mar? / no I don’t do discord.

@wvwwvwwv
Copy link
Owner

wvwwvwwv commented Mar 5, 2025

Image
Image

I dug into the issue and figured that thread_local in sdd causes the difference.

  • tlv_get_addr.

It's interesting that the cost of tlv_get_addr has never been detected in sdd's benchmark suite. I'll need to spend some time thinking about how to optimize calls to tlv_get_addr.

@changgyoopark-db changgyoopark-db added enhancement New feature or request and removed question Further information is requested labels Mar 5, 2025
@wvwwvwwv
Copy link
Owner

wvwwvwwv commented Mar 5, 2025

Substituting thread_local! with #[thread_local] helps much, but that's gated behind an unstable feature flag.

@redactedontop
Copy link
Author

redactedontop commented Mar 8, 2025 via email

@redactedontop
Copy link
Author

redactedontop commented Mar 8, 2025 via email

@redactedontop
Copy link
Author

redactedontop commented Mar 8, 2025 via email

@redactedontop
Copy link
Author

redactedontop commented Mar 8, 2025 via email

@redactedontop
Copy link
Author

Also, what about Atomics instead of sdd? You can prevent ABA in HashMaps with double-checks (checking not just the alue but the others).

@wvwwvwwv
Copy link
Owner

Thanks, @redactedontop. I substituted the thread-local variable's non-const assignment with a const block -> SDD 3.0.8.

Unfortunately, use-after-free is the problem, not ABA. The delayed memory reclamation prevents use-after-free if the hash map has been resized, and relying on thread-local is currently the optimal way to prevent it.

  • If SCC only targets Intel/PPC, we can use HTM.

@redactedontop
Copy link
Author

How come? Can't you prevent UAF with Atomics?

@wvwwvwwv
Copy link
Owner

Not quite possible as far as I know since while a reader is in a loop of { read-pointer - do-something - validate }, anything can happen including deallocating the pointee. EBR/RCU/HTM/hazard-pointers etc prevent this from happening.

@redactedontop
Copy link
Author

Then why not use HTM when supported instead of a full thread_local? Also, have you thought about the nightly #[thread_local] version?

@wvwwvwwv
Copy link
Owner

Simple: there is no decent crate that enables easy use of HTM instructions, and ![thread_local] is still unstable.

@redactedontop
Copy link
Author

Fair. Have you thought about making a nightly version of the crate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants