-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
On Windows systems where NUMA nodes span more than one processor group and the process is constrained to a single NUMA node, initialize_hardware_concurrency_info
computes incorrect values. This leads to wrong max_allowed_parallelism
& numa_nodes
values
Version
2022.2.0
Environment
- Customer is running:
- Windows Server 2025 Standard
- Dual socket AMD EPYC 9454
- We were able to reproduce with
- Windows Server 2025 Datacenter
- Google Cloud n2d-highcpu-224 VM with visible cores set to 96 cores (192 vCPUs)
- These are both Windows systems with two CPUs with 96 logical cores each (48 with hyperthreading)
- There are two NUMA nodes, one for each CPU
- Windows splits the processors into 3 processor groups (0, 1, 2), so that group 1 contains cores from both NUMA nodes:
- Node 0: 0:0-63 1:0-31
- Node 1: 1:32-63 2:0-63
- The process that uses TBB is constrained from the outside using a JobObject to one of the NUMA nodes (let's say node 0)
- This means that threads from our process are allowed to run on two processor groups: 0 and 1
- In group 0, we may run on any processor
- In group 1, we may only run on half the processors (because the other half are in NUMA node 1)
- Windows will randomly schedule us onto one of the two processor groups
Observed Behavior
Depending on which group we are scheduled on, initialize_hardware_concurrency_info
will behave in different ways, but both are wrong
- If we are scheduled on group 0:
- TBB sees a fully set affinity mask (for the current group 0) and deduces wrongly that the process is not constrained at all (while in reality, we may not use group 2 and only half of group 1)
- It then fetches the info from all processor groups and sets the max_allowed_parallelism to 192, makes both numa nodes available, etc.
- If we are scheduled on group 1:
- TBB sees a partial affinity mask (for the current group 1) and deduces wrongly that the process is constrained to this affinity mask and the current processor group (while in reality processor group 0 is also available for us)
- It then skips fetching any group info and sets the max_allowed_parallelism to 32
- It also makes only one numa node available, but it is always node 0, even when constrained to node 1
Expected Behavior
TBB should set max_allowed_parallelism to 96 in our scenario and make only numa node 0 available (or only numa node 1 if constrained to the other numa node)
The numa node pinning of the task arena is particularly important here, since each threads is bound to a processor group and windows will not reschedule a thread onto a different processor group. Therefore it is important that TBB puts 64 threads onto group 0 and 32 threads onto group 1.
Some hints as to how this could work:
- TBB should use GetProcessGroupAffinity to determine if the current process is allowed to run on more than one processor group
- Unfortunately, there does not seem to be a good way to get the affinity masks for all allowed processor groups
- Best I could come up with is:
- GetThreadGroupAffinity gives you the affinity for the processor group of the current thread
- loop through the other allowed groups
- loop through processors 0-63
- SetThreadGroupAffinity to exactly one processor in that group and record if this succeeds or fails
- Build an affinity mask that way
- loop through processors 0-63
- Restore original affinity mask
- This is admittedly very ugly, but seems to work
- hwloc is also no help here, unfortunately...
- Best I could come up with is:
- numa nodes that we are not allowed to run on should then be removed from
tbb::info::numa_nodes()
- But the indexes should stay the same, i.e. if constrained to run on NUMA node 1, tbb::info::numa_nodes() should return
[1]
, not[0]
.
Steps To Reproduce
Just call thw following on a system as described above and observe the incorrect values:
tbb::global_control::active_value(tbb::global_control::max_allowed_parallelism))
tbb::info::numa_nodes()
tbb::info::default_concurrency()
tbb::info::default_concurrency(numa_id)