-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
In porting our large application from TBB 2020.2 to oneTBB 2022.1, I noticed some strange increased CPU usage on Intel i7's (and AMD Ryzen 9) processors that I wasn't able to reproduce on my Intel i9. I've reproduced the behaviour in some contrived code listed below. It uses a loop (mimicking continuous mouse movement in the application) around a parallel_reduce with a range of 2 items and the task does relatively little work.
When running the test program below linked against TBB 2020.2 on either of the i7 or Ryzen 9 test environments, the CPU usage in only goes up to about the 40-60% range. When running it linked against oneTBB 2022.1, the CPU usage goes up to nearly 100%.
So really, the question here is one of why? AFAICT, we hit 100% CPU usage in oneTBB because in each iteration of the loop, two different threads end up processing the tasks. This effectively causes all cores to be used. But in TBB, this also happens but it seems to favour the main thread more often, leading to the reduced CPU usage. I would have chalked this up to possibly better scheduling in oneTBB but the odd thing is that it doesn't reproduce on the i9's. And I see the same behaviour against either our Windows or Linux test environments ruling out OS differences.
In looking at the profiles of our actual code in Visual Studio, the hot path looks like this:
| Function Name | Total CPU [unit, %] | Self CPU [unit, %] | Module | Category |
|---|---|---|---|---|
| || + tbb::detail::r1::rml::private_worker::thread_routine | 53465 (92.18%) | 0 (0.00%) | tbb12 | Kernel | Runtime |
| ||| + tbb::detail::r1::rml::private_worker::run | 53465 (92.18%) | 6 (0.01%) | tbb12 | Kernel | Runtime |
| |||| + tbb::detail::r1::thread_dispatcher::process | 53416 (92.09%) | 0 (0.00%) | tbb12 | Kernel | Runtime |
| ||||| + tbb::detail::r1::arena::process | 53398 (92.06%) | 3 (0.01%) | tbb12 | Kernel | Runtime |
| |||||| + tbb::detail::r1::task_dispatcher::local_wait_for_all<0,tbb::detail::r1::outermost_worker_waiter> | 53387 (92.04%) | 4 (0.01%) | tbb12 | Kernel | Runtime |
| ||||||| + tbb::detail::r1::task_dispatcher::receive_or_steal_task<0,tbb::detail::r1::outermost_worker_waiter> | 52153 (89.91%) | 21 (0.04%) | tbb12 | Kernel |
| |||||||| + tbb::detail::r1::outermost_worker_waiter::continue_execution | 49745 (85.76%) | 839 (1.45%) | tbb12 | Kernel |
| ||||||||| - [External Call] SwitchToThread | 40180 (69.27%) | 40180 (69.27%) | kernelbase |
This lead me to look at code in receive_or_steal_task(). One notable difference that I see is that TBB merely does a prolonged_pause() whereas oneTBB now has this complicated code path into waiter_base::pause() which uses stealing_loop_backoff which seems to be more prone to spinning. However, I'm not really well versed in the code here and could be barking up the wrong tree.
Environments
The issue was reproduced under these 2 environments (one Windows, one Linux):
- Intel i7-9700, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)
- AMD Ryzen 9 9950X3D, Ubuntu 24.04, gcc 11.5
Tested against:
- Intel i9-14900K, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)
Additionally, the increased CPU usage was also observed on an Apple M3, macOS, Xcode combo but I didn't test extensively on this machine so have largely omitted it from the description. This is just a further data point that it's not OS-specific.
Steps To Reproduce
Here's the test program:
// test_reduce.cpp
//
// Compile options:
// gcc 11.2: gcc --std=c++17 -O2
// MSVC: cl /std:c++17 /O2
//
#include <tbb/parallel_reduce.h>
#include <tbb/blocked_range.h>
#include <tbb/concurrent_vector.h>
#include <tbb/parallel_reduce.h>
#include <tbb/task_arena.h>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>
int main()
{
using namespace tbb;
using range = blocked_range<int *>;
using clock = std::chrono::high_resolution_clock;
using std::chrono::seconds;
using std::chrono::duration;
int n = 2; //int(this_task_arena::max_concurrency() * 1.5);
std::vector<int> values(n);
concurrent_vector<int> threads;
duration<double> total;
int laps = 0;
volatile int result = 0;
auto stop_time = clock::now() + seconds(10);
while (clock::now() <= stop_time)
{
auto begin = clock::now();
result += parallel_reduce(
range(values.data(), values.data() + n, /*grain_size*/ 1),
/*init*/ 0,
[&threads](const range &r, int init) -> int
{
threads.push_back(this_task_arena::current_thread_index());
for (int *a = r.begin(); a != r.end(); ++a)
{
init += *a;
#if 1
for (int i = 0; i < 2000000; ++i)
init += i;
#endif
}
return init;
},
[](int x, int y) -> int
{
return x + y;
}
);
total += clock::now() - begin;
++laps;
}
std::sort(threads.begin(), threads.end());
auto last_unique = std::unique(threads.begin(), threads.end());
int num_threads = std::distance(threads.begin(), last_unique);
std::cout << laps / total.count() << " fps, laps = " << laps
<< ", num_threads = " << num_threads << "\n";
return 0;
}