Increased CPU usage on parallel_reduce with small ranges

In porting our large application from TBB 2020.2 to oneTBB 2022.1, I noticed some strange increased CPU usage on Intel i7's (and AMD Ryzen 9) processors that I wasn't able to reproduce on my Intel i9. I've reproduced the behaviour in some contrived code listed below. It uses a loop (mimicking continuous mouse movement in the application) around a parallel_reduce with a range of 2 items and the task does relatively little work.

When running the test program below linked against TBB 2020.2 on either of the i7 or Ryzen 9 test environments, the CPU usage in only goes up to about the 40-60% range. When running it linked against oneTBB 2022.1, the CPU usage goes up to nearly 100%. 

So really, the question here is one of why? AFAICT, we hit 100% CPU usage in oneTBB because in each iteration of the loop, two different threads end up processing the tasks. This effectively causes all cores to be used. But in TBB, this also happens but it seems to favour the main thread more often, leading to the reduced CPU usage. I would have chalked this up to possibly better scheduling in oneTBB but the odd thing is that it doesn't reproduce on the i9's. And I see the same behaviour against either our Windows or Linux test environments ruling out OS differences.

In looking at the profiles of our actual code in Visual Studio, the hot path looks like this:
|Function Name|Total CPU \[unit, %\]|Self CPU \[unit, %\]|Module|Category|
|-|-|-|-|-|
|\|\| + tbb::detail::r1::rml::private\_worker::thread\_routine|53465 \(92.18%\)|0 \(0.00%\)|tbb12|Kernel \| Runtime|
|\|\|\| + tbb::detail::r1::rml::private\_worker::run|53465 \(92.18%\)|6 \(0.01%\)|tbb12|Kernel \| Runtime|
|\|\|\|\| + tbb::detail::r1::thread\_dispatcher::process|53416 \(92.09%\)|0 \(0.00%\)|tbb12|Kernel \| Runtime|
|\|\|\|\|\| + tbb::detail::r1::arena::process|53398 \(92.06%\)|3 \(0.01%\)|tbb12|Kernel \| Runtime|
|\|\|\|\|\|\| + tbb::detail::r1::task\_dispatcher::local\_wait\_for\_all\<0,tbb::detail::r1::outermost\_worker\_waiter\>|53387 \(92.04%\)|4 \(0.01%\)|tbb12|Kernel \| Runtime|
|\|\|\|\|\|\|\| + tbb::detail::r1::task\_dispatcher::receive\_or\_steal\_task\<0,tbb::detail::r1::outermost\_worker\_waiter\>|52153 \(89.91%\)|21 \(0.04%\)|tbb12|Kernel|
|\|\|\|\|\|\|\|\| + tbb::detail::r1::outermost\_worker\_waiter::continue\_execution|49745 \(85.76%\)|839 \(1.45%\)|tbb12|Kernel|
|\|\|\|\|\|\|\|\|\| - \[External Call\] SwitchToThread|40180 \(69.27%\)|40180 \(69.27%\)|kernelbase||

This lead me to look at code in `receive_or_steal_task()`. One notable difference that I see is that TBB merely does a `prolonged_pause()` whereas oneTBB now has this complicated code path into `waiter_base::pause()` which uses `stealing_loop_backoff` which seems to be more prone to spinning. However, I'm not really well versed in the code here and could be barking up the wrong tree.

# Environments
The issue was reproduced under these 2 environments (one Windows, one Linux):
* Intel i7-9700, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)
* AMD Ryzen 9 9950X3D, Ubuntu 24.04, gcc 11.5

Tested against:
* Intel i9-14900K, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)

Additionally, the increased CPU usage was also observed on an Apple M3, macOS, Xcode combo but I didn't test extensively on this machine so have largely omitted it from the description. This is just a further data point that it's not OS-specific.

# Steps To Reproduce
Here's the test program:
```
// test_reduce.cpp
//
// Compile options:
// gcc 11.2: gcc --std=c++17 -O2
// MSVC: cl /std:c++17 /O2
//
#include <tbb/parallel_reduce.h>
#include <tbb/blocked_range.h>
#include <tbb/concurrent_vector.h>
#include <tbb/parallel_reduce.h>
#include <tbb/task_arena.h>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>

int main()
{
    using namespace tbb;
    using range = blocked_range<int *>;
    using clock = std::chrono::high_resolution_clock;
    using std::chrono::seconds;
    using std::chrono::duration;

    int n = 2; //int(this_task_arena::max_concurrency() * 1.5);
    std::vector<int> values(n);

    concurrent_vector<int> threads;
    duration<double> total;
    int laps = 0;
    volatile int result = 0;
    auto stop_time = clock::now() + seconds(10);
    while (clock::now() <= stop_time)
    {
        auto begin = clock::now();
        result += parallel_reduce(
            range(values.data(), values.data() + n, /*grain_size*/ 1),
            /*init*/ 0,
            [&threads](const range &r, int init) -> int
            {
                threads.push_back(this_task_arena::current_thread_index());
                for (int *a = r.begin(); a != r.end(); ++a)
                {
                    init += *a;
#if 1
                    for (int i = 0; i < 2000000; ++i)
                        init += i;
#endif
                }
                return init;
            },
            [](int x, int y) -> int
            {
                return x + y;
            }
        );
        total += clock::now() - begin;
        ++laps;
    }

    std::sort(threads.begin(), threads.end());
    auto last_unique = std::unique(threads.begin(), threads.end());
    int num_threads = std::distance(threads.begin(), last_unique);

    std::cout << laps / total.count() << " fps, laps = " << laps
              << ", num_threads = " << num_threads << "\n";

    return 0;
}
```

Function Name	Total CPU [unit, %]	Self CPU [unit, %]	Module	Category
\|\| + tbb::detail::r1::rml::private_worker::thread_routine	53465 (92.18%)	0 (0.00%)	tbb12	Kernel \| Runtime
\|\|\| + tbb::detail::r1::rml::private_worker::run	53465 (92.18%)	6 (0.01%)	tbb12	Kernel \| Runtime
\|\|\|\| + tbb::detail::r1::thread_dispatcher::process	53416 (92.09%)	0 (0.00%)	tbb12	Kernel \| Runtime
\|\|\|\|\| + tbb::detail::r1::arena::process	53398 (92.06%)	3 (0.01%)	tbb12	Kernel \| Runtime
\|\|\|\|\|\| + tbb::detail::r1::task_dispatcher::local_wait_for_all<0,tbb::detail::r1::outermost_worker_waiter>	53387 (92.04%)	4 (0.01%)	tbb12	Kernel \| Runtime
\|\|\|\|\|\|\| + tbb::detail::r1::task_dispatcher::receive_or_steal_task<0,tbb::detail::r1::outermost_worker_waiter>	52153 (89.91%)	21 (0.04%)	tbb12	Kernel
\|\|\|\|\|\|\|\| + tbb::detail::r1::outermost_worker_waiter::continue_execution	49745 (85.76%)	839 (1.45%)	tbb12	Kernel
\|\|\|\|\|\|\|\|\| - [External Call] SwitchToThread	40180 (69.27%)	40180 (69.27%)	kernelbase

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increased CPU usage on parallel_reduce with small ranges #1871

Environments

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Increased CPU usage on parallel_reduce with small ranges #1871

Description

Environments

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions