-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Summary
When short parallel phases are interleaved with single threaded work, we can see substantial performance regressions, up to 16% has been measured on laptops in real-world workflows. We believe this is due to the increase CPU load caused by PR #1352, and is related to the report of increased CPU usage in #1871
This was observed in testing of a key interactive display workflow in a major application - it is not merely a "theoretical" slowdown on a synthetic example.
When testing synthetic examples it was obvious that behaviour is extremely variable depending on the exact form of the workload - we did in fact observe a speedup (we presume from PR#1352) when the single threaded portion of the workload was shorter.
Version
We are comparing TBB 2021.12.0 (before) with TBB 2022.0.0 (after)
Environment
Intel Xeon W-2255
Windows 11 24H2
MSVC 17.14.7
Observed Behavior
The synthetic example below runs approximately 8% slower in wallclock time in TBB 2022.0 than it did in TBB 2021.12 . A crude estimate of CPU usage increases from ~32% to ~45% (which is expected behaviour for the new spinning yield behaviour)
Expected Behavior
Minimal performance change (or improvement) between TBB 2021.12 and TBB 2022.0.0
Steps To Reproduce
I used this synthetic mersenne twister workload to reproduce the issue (I'm sure there are issues with this as a microbenchmark, but it was just a very simple attempt to repro the workload shape that was causing problems ):
constexpr int samples_per_mt = 1000000;
constexpr int samples_per_st = 1000000;
constexpr int samples_per_pass = samples_per_st + samples_per_mt;
constexpr int pass_count = 1000;
std::vector<double> results;
results.resize(samples_per_pass, 0);
int worker_count = tbb::this_task_arena::max_concurrency();
struct randomizer
{
std::default_random_engine m_eng;
std::uniform_int_distribution<int> m_dist;
randomizer(int seed) : m_eng(seed), m_dist(0, 100000) {}
int get() { return m_dist(m_eng); }
};
std::vector<randomizer> per_thread_randomizer;
for (int idx = 0; idx < worker_count; ++idx)
per_thread_randomizer.emplace_back(idx);
auto time_start = std::chrono::high_resolution_clock::now();
for (int idx = 0; idx < pass_count; ++idx)
{
int base_pass_idx = 0;
for (int st_idx = 0; st_idx < samples_per_st; ++st_idx)
results[base_pass_idx + st_idx] = per_thread_randomizer[0].get();
// Use simple partitioner because problem case did.
base_pass_idx = base_pass_idx + samples_per_st;
tbb::parallel_for(
tbb::blocked_range(0, samples_per_mt, 1000),
[&](const tbb::blocked_range<int>& r) {
int thread_idx = tbb::this_task_arena::current_thread_index();
for (int mt_idx = r.begin(); mt_idx != r.end(); ++mt_idx)
{
results[base_pass_idx + mt_idx] =
per_thread_randomizer[thread_idx].get();
}
},
tbb::simple_partitioner());
}
auto time_end = std::chrono::high_resolution_clock::now();
std::cout << "Time taken : " << time_end - time_start << std::endl;
Note that 50% of the workload (and therefore I expect much more than 50% of the time) is spent single threaded. Although obviously this is non-optimal it does represent quite a common situation where it's only been practical to parallelize part of a single threaded algorithm.
My belief is the extra CPU usage caused by the spinning cores is causing the clock speeds to be lower during single threaded execution than they would be if there was no execution on those cores at all. While the repeated yield does ensure that the spinning worker threads don't take precedence over other work the OS might schedule onto that core, they still have a global side effect on performance of other cores when spinning if there is no other work being scheduled. This is consistent with the largest effect being observed on laptops.
While I understand that parallel_phase is being introduced to add more user control over this behaviour, my preference would be for the new spinning yield behaviour to be opt-in rather than opt-out. The current parallel_phase documentation also doesn't point out this interaction with single-threaded workfloads.