initialize_hardware_concurrency_info is broken on Windows in certain configurations

# Summary
On Windows systems where NUMA nodes span more than one processor group and the process is constrained to a single NUMA node, `initialize_hardware_concurrency_info` computes incorrect values. This leads to wrong `max_allowed_parallelism` & `numa_nodes` values

# Version
2022.2.0

# Environment
- Customer is running:
  - Windows Server 2025 Standard
  - Dual socket AMD EPYC 9454
- We were able to reproduce with
  - Windows Server 2025 Datacenter
  - Google Cloud n2d-highcpu-224 VM with visible cores set to 96 cores (192 vCPUs)
- These are both Windows systems with two CPUs with 96 logical cores each (48 with hyperthreading)
- There are two NUMA nodes, one for each CPU
- Windows splits the processors into 3 processor groups (0, 1, 2), so that group 1 contains cores from both NUMA nodes:
  - Node 0: 0:0-63 1:0-31
  - Node 1: 1:32-63 2:0-63
- The process that uses TBB is constrained from the outside using a JobObject to one of the NUMA nodes (let's say node 0)
  - This means that threads from our process are allowed to run on two processor groups: 0 and 1
  - In group 0, we may run on any processor
  - In group 1, we may only run on half the processors (because the other half are in NUMA node 1)
  - Windows will randomly schedule us onto one of the two processor groups

# Observed Behavior

Depending on which group we are scheduled on, `initialize_hardware_concurrency_info` will behave in different ways, but both are wrong
- If we are scheduled on group 0:
  - TBB sees a fully set affinity mask (for the current group 0) and deduces wrongly that the process is not constrained at all (while in reality, we may not use group 2 and only half of group 1)
  - It then fetches the info from all processor groups and sets the max_allowed_parallelism to 192, makes both numa nodes available, etc.
- If we are scheduled on group 1:
  - TBB sees a partial affinity mask (for the current group 1) and deduces wrongly that the process is constrained to this affinity mask *and the current processor group* (while in reality processor group 0 is also available for us)
  - It then skips fetching any group info and sets the max_allowed_parallelism to 32
  - It also makes only one numa node available, but it is always node 0, even when constrained to node 1

# Expected Behavior
TBB should set max_allowed_parallelism to 96 in our scenario and make only numa node 0 available (or only numa node 1 if constrained to the other numa node)

The numa node pinning of the task arena is particularly important here, since each threads is bound to a processor group and windows will not reschedule a thread onto a different processor group. Therefore it is important that TBB puts 64 threads onto group 0 and 32 threads onto group 1.

Some hints as to how this could work:
- TBB should use GetProcessGroupAffinity to determine if the current process is allowed to run on more than one processor group
- Unfortunately, there does not seem to be a good way to get the affinity masks for all allowed processor groups
  - Best I could come up with is:
    - GetThreadGroupAffinity gives you the affinity for the processor group of the current thread
    - loop through the other allowed groups
      - loop through processors 0-63
        - SetThreadGroupAffinity to exactly one processor in that group and record if this succeeds or fails
      - Build an affinity mask that way
    - Restore original affinity mask
  - This is admittedly very ugly, but seems to work
  - hwloc is also no help here, unfortunately...
- numa nodes that we are not allowed to run on should then be removed from `tbb::info::numa_nodes()`
- But the indexes should stay the same, i.e. if constrained to run on NUMA node 1, tbb::info::numa_nodes() should return `[1]`, not `[0]`.

# Steps To Reproduce
Just call thw following on a system as described above and observe the incorrect values:
- `tbb::global_control::active_value(tbb::global_control::max_allowed_parallelism))`
- `tbb::info::numa_nodes()`
- `tbb::info::default_concurrency()`
- `tbb::info::default_concurrency(numa_id)`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

initialize_hardware_concurrency_info is broken on Windows in certain configurations #1837

Summary

Version

Environment

Observed Behavior

Expected Behavior

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

initialize_hardware_concurrency_info is broken on Windows in certain configurations #1837

Description

Summary

Version

Environment

Observed Behavior

Expected Behavior

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions