Skip to content

initialize_hardware_concurrency_info is broken on Windows in certain configurations #1837

@mika-fischer

Description

@mika-fischer

Summary

On Windows systems where NUMA nodes span more than one processor group and the process is constrained to a single NUMA node, initialize_hardware_concurrency_info computes incorrect values. This leads to wrong max_allowed_parallelism & numa_nodes values

Version

2022.2.0

Environment

  • Customer is running:
    • Windows Server 2025 Standard
    • Dual socket AMD EPYC 9454
  • We were able to reproduce with
    • Windows Server 2025 Datacenter
    • Google Cloud n2d-highcpu-224 VM with visible cores set to 96 cores (192 vCPUs)
  • These are both Windows systems with two CPUs with 96 logical cores each (48 with hyperthreading)
  • There are two NUMA nodes, one for each CPU
  • Windows splits the processors into 3 processor groups (0, 1, 2), so that group 1 contains cores from both NUMA nodes:
    • Node 0: 0:0-63 1:0-31
    • Node 1: 1:32-63 2:0-63
  • The process that uses TBB is constrained from the outside using a JobObject to one of the NUMA nodes (let's say node 0)
    • This means that threads from our process are allowed to run on two processor groups: 0 and 1
    • In group 0, we may run on any processor
    • In group 1, we may only run on half the processors (because the other half are in NUMA node 1)
    • Windows will randomly schedule us onto one of the two processor groups

Observed Behavior

Depending on which group we are scheduled on, initialize_hardware_concurrency_info will behave in different ways, but both are wrong

  • If we are scheduled on group 0:
    • TBB sees a fully set affinity mask (for the current group 0) and deduces wrongly that the process is not constrained at all (while in reality, we may not use group 2 and only half of group 1)
    • It then fetches the info from all processor groups and sets the max_allowed_parallelism to 192, makes both numa nodes available, etc.
  • If we are scheduled on group 1:
    • TBB sees a partial affinity mask (for the current group 1) and deduces wrongly that the process is constrained to this affinity mask and the current processor group (while in reality processor group 0 is also available for us)
    • It then skips fetching any group info and sets the max_allowed_parallelism to 32
    • It also makes only one numa node available, but it is always node 0, even when constrained to node 1

Expected Behavior

TBB should set max_allowed_parallelism to 96 in our scenario and make only numa node 0 available (or only numa node 1 if constrained to the other numa node)

The numa node pinning of the task arena is particularly important here, since each threads is bound to a processor group and windows will not reschedule a thread onto a different processor group. Therefore it is important that TBB puts 64 threads onto group 0 and 32 threads onto group 1.

Some hints as to how this could work:

  • TBB should use GetProcessGroupAffinity to determine if the current process is allowed to run on more than one processor group
  • Unfortunately, there does not seem to be a good way to get the affinity masks for all allowed processor groups
    • Best I could come up with is:
      • GetThreadGroupAffinity gives you the affinity for the processor group of the current thread
      • loop through the other allowed groups
        • loop through processors 0-63
          • SetThreadGroupAffinity to exactly one processor in that group and record if this succeeds or fails
        • Build an affinity mask that way
      • Restore original affinity mask
    • This is admittedly very ugly, but seems to work
    • hwloc is also no help here, unfortunately...
  • numa nodes that we are not allowed to run on should then be removed from tbb::info::numa_nodes()
  • But the indexes should stay the same, i.e. if constrained to run on NUMA node 1, tbb::info::numa_nodes() should return [1], not [0].

Steps To Reproduce

Just call thw following on a system as described above and observe the incorrect values:

  • tbb::global_control::active_value(tbb::global_control::max_allowed_parallelism))
  • tbb::info::numa_nodes()
  • tbb::info::default_concurrency()
  • tbb::info::default_concurrency(numa_id)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions