Skip to content

Increased CPU usage on parallel_reduce with small ranges #1871

@e4lam

Description

@e4lam

In porting our large application from TBB 2020.2 to oneTBB 2022.1, I noticed some strange increased CPU usage on Intel i7's (and AMD Ryzen 9) processors that I wasn't able to reproduce on my Intel i9. I've reproduced the behaviour in some contrived code listed below. It uses a loop (mimicking continuous mouse movement in the application) around a parallel_reduce with a range of 2 items and the task does relatively little work.

When running the test program below linked against TBB 2020.2 on either of the i7 or Ryzen 9 test environments, the CPU usage in only goes up to about the 40-60% range. When running it linked against oneTBB 2022.1, the CPU usage goes up to nearly 100%.

So really, the question here is one of why? AFAICT, we hit 100% CPU usage in oneTBB because in each iteration of the loop, two different threads end up processing the tasks. This effectively causes all cores to be used. But in TBB, this also happens but it seems to favour the main thread more often, leading to the reduced CPU usage. I would have chalked this up to possibly better scheduling in oneTBB but the odd thing is that it doesn't reproduce on the i9's. And I see the same behaviour against either our Windows or Linux test environments ruling out OS differences.

In looking at the profiles of our actual code in Visual Studio, the hot path looks like this:

Function Name Total CPU [unit, %] Self CPU [unit, %] Module Category
|| + tbb::detail::r1::rml::private_worker::thread_routine 53465 (92.18%) 0 (0.00%) tbb12 Kernel | Runtime
||| + tbb::detail::r1::rml::private_worker::run 53465 (92.18%) 6 (0.01%) tbb12 Kernel | Runtime
|||| + tbb::detail::r1::thread_dispatcher::process 53416 (92.09%) 0 (0.00%) tbb12 Kernel | Runtime
||||| + tbb::detail::r1::arena::process 53398 (92.06%) 3 (0.01%) tbb12 Kernel | Runtime
|||||| + tbb::detail::r1::task_dispatcher::local_wait_for_all<0,tbb::detail::r1::outermost_worker_waiter> 53387 (92.04%) 4 (0.01%) tbb12 Kernel | Runtime
||||||| + tbb::detail::r1::task_dispatcher::receive_or_steal_task<0,tbb::detail::r1::outermost_worker_waiter> 52153 (89.91%) 21 (0.04%) tbb12 Kernel
|||||||| + tbb::detail::r1::outermost_worker_waiter::continue_execution 49745 (85.76%) 839 (1.45%) tbb12 Kernel
||||||||| - [External Call] SwitchToThread 40180 (69.27%) 40180 (69.27%) kernelbase

This lead me to look at code in receive_or_steal_task(). One notable difference that I see is that TBB merely does a prolonged_pause() whereas oneTBB now has this complicated code path into waiter_base::pause() which uses stealing_loop_backoff which seems to be more prone to spinning. However, I'm not really well versed in the code here and could be barking up the wrong tree.

Environments

The issue was reproduced under these 2 environments (one Windows, one Linux):

  • Intel i7-9700, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)
  • AMD Ryzen 9 9950X3D, Ubuntu 24.04, gcc 11.5

Tested against:

  • Intel i9-14900K, Windows 11 24H2, MSVC 19.42.3444 (Visual Studio 2022 17.12.12)

Additionally, the increased CPU usage was also observed on an Apple M3, macOS, Xcode combo but I didn't test extensively on this machine so have largely omitted it from the description. This is just a further data point that it's not OS-specific.

Steps To Reproduce

Here's the test program:

// test_reduce.cpp
//
// Compile options:
// gcc 11.2: gcc --std=c++17 -O2
// MSVC: cl /std:c++17 /O2
//
#include <tbb/parallel_reduce.h>
#include <tbb/blocked_range.h>
#include <tbb/concurrent_vector.h>
#include <tbb/parallel_reduce.h>
#include <tbb/task_arena.h>
#include <algorithm>
#include <chrono>
#include <iostream>
#include <vector>

int main()
{
    using namespace tbb;
    using range = blocked_range<int *>;
    using clock = std::chrono::high_resolution_clock;
    using std::chrono::seconds;
    using std::chrono::duration;

    int n = 2; //int(this_task_arena::max_concurrency() * 1.5);
    std::vector<int> values(n);

    concurrent_vector<int> threads;
    duration<double> total;
    int laps = 0;
    volatile int result = 0;
    auto stop_time = clock::now() + seconds(10);
    while (clock::now() <= stop_time)
    {
        auto begin = clock::now();
        result += parallel_reduce(
            range(values.data(), values.data() + n, /*grain_size*/ 1),
            /*init*/ 0,
            [&threads](const range &r, int init) -> int
            {
                threads.push_back(this_task_arena::current_thread_index());
                for (int *a = r.begin(); a != r.end(); ++a)
                {
                    init += *a;
#if 1
                    for (int i = 0; i < 2000000; ++i)
                        init += i;
#endif
                }
                return init;
            },
            [](int x, int y) -> int
            {
                return x + y;
            }
        );
        total += clock::now() - begin;
        ++laps;
    }

    std::sort(threads.begin(), threads.end());
    auto last_unique = std::unique(threads.begin(), threads.end());
    int num_threads = std::distance(threads.begin(), last_unique);

    std::cout << laps / total.count() << " fps, laps = " << laps
              << ", num_threads = " << num_threads << "\n";

    return 0;
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions