Skip to content

Commit 3358cf4

Browse files
RFC for histogram CPU backend implementation (#1930)
Proposal for CPU backend implementation for histogram.
1 parent 2e31d5c commit 3358cf4

File tree

1 file changed

+192
-0
lines changed
  • rfcs/proposed/algorithms_histogram_cpu_backends

1 file changed

+192
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,192 @@
1+
# Host Backends Support for the Histogram APIs
2+
3+
## Introduction
4+
The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These
5+
APIs are defined in the oneAPI Specification 1.4. Please see the
6+
[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms)
7+
for details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram
8+
support to these backends.
9+
10+
The pull request for the proposed implementation exists [here](https://github.com/oneapi-src/oneDPL/pull/1974).
11+
12+
## Motivations
13+
There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding
14+
the support is simply to be spec compliant with the oneAPI specification.
15+
16+
## Design Considerations
17+
18+
### Key Requirements
19+
Provide support for the `histogram` APIs with the following policies and backends:
20+
- Policies: `seq`, `unseq`, `par`, `par_unseq`
21+
- Backends: `serial`, `tbb`, `openmp`
22+
23+
Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which
24+
they can select from when using oneDPL. It is important that all combinations of these options have support for the
25+
`histogram` APIs.
26+
27+
### Performance
28+
Histogram algorithms typically involve minimal computation and are likely to be memory-bound. So, the implementation prioritizes
29+
reducing memory accesses and minimizing temporary memory traffic.
30+
31+
For CPU backends, we will focus on input sizes ranging from 32K to 4M elements and 32 - 4k histogram bins. Smaller sizes
32+
of input may best be suited for serial histogram implementation, and very large sizes may be better suited for GPU
33+
device targets. Histogram bin counts can vary from use case to use case, but the most common rule of thumb is to size
34+
the number of bins approximately to the cube root of the number of input elements. For our input size ranges this gives
35+
us a range of 32 - 256. In practice, some users find need to increase the number of bins beyond that rough rule.
36+
For this reason, we have aelected our histogram size range to 32 - 4k elements.
37+
38+
### Memory Footprint
39+
There are no guidelines here from the standard library as this is an extension API. Still, we will minimize memory
40+
footprint where possible.
41+
42+
### Code Reuse
43+
We want to minimize adding requirements for parallel backends to implement, and lift as much as possible to the
44+
algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual
45+
backends, and instead rely upon `__parallel_for`.
46+
47+
### SIMD/openMP SIMD Implementation
48+
Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across
49+
loop iterations, oneDPL does not directly use any intrinsics.
50+
51+
There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment
52+
there are two APIs, even and custom range which have significantly different methods to determine the bin to increment.
53+
For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as each input
54+
has the same mathematical operations applied to each. However, for the custom range API, each input element uses a
55+
binary search through a list of bin boundaries to determine the selected bin. This operation will have a different
56+
length and control flow based upon each input element and will be very difficult to vectorize.
57+
58+
Next, let's consider the increment operation itself. This operation increments a data dependent bin location, and may
59+
result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without
60+
more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not
61+
available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the
62+
vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which
63+
we can use to exempt the increment from SIMD operations as well. However, this often results in vectorization being
64+
refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop.
65+
66+
Last, for our below proposed implementation there is the task of combining temporary histogram data into the global
67+
output histogram. This is directly vectorizable via our existing brick_walk implementation, and will be vectorized when
68+
a vector policy is used.
69+
70+
### Serial Backend
71+
We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all
72+
policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other
73+
approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In our
74+
benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly into
75+
the output global histogram. There seems to be no performance motivated reason to special case the serial algorithm to
76+
use the global histogram directly.
77+
78+
## Existing APIs / Patterns
79+
80+
### count_if
81+
`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a
82+
sequence. `count_if` relies upon the `transform_reduce` pattern internally, and returns a scalar-typed value and doesn't
83+
provide any function to modify the variable being incremented. Using `count_if` without significant modification would
84+
require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth
85+
perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a
86+
well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched
87+
pattern performance-wise.
88+
89+
### parallel_for
90+
`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what
91+
we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use
92+
`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output
93+
histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way
94+
to synchronize and accumulate between threads.
95+
96+
## Alternative Approaches
97+
98+
### Atomics
99+
This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the
100+
output histogram data, we can merely run a `parallel_for` pattern.
101+
102+
To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics
103+
specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can
104+
only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then
105+
copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output
106+
data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic operations, but that is
107+
only available for the OpenMP backend. The working plan was to implement a macro like `_ONEDPL_ATOMIC_INCREMENT(var)`
108+
which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins like `InterlockedAdd` or
109+
`__atomic_fetch_add_n`. In a proof of concept implementation, this seemed to work, but does reach more into details than
110+
compiler / OS specifics than is desired for implementations prior to `C++20`.
111+
112+
After experimenting with a proof of concept implementation of this implementation, it seems that the atomic
113+
implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with
114+
number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation
115+
outperformed the proposed implementation (below). However, this was generally limited to some specific cases where the
116+
number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases with a
117+
small number for input elements relative to number of bins. This makes sense because the atomic implementation is able
118+
to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when the number of
119+
bins is large compared to the number of input elements. With many bins, contention on atomics is also limited as
120+
compared to the embarrassingly parallel proposal which does experience this contention.
121+
122+
When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important
123+
use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to
124+
characterize the result. Cases for which there are similar or more bins than input elements are not very practical in
125+
practice. The maintenance and complexity cost associated with supporting and maintaining a second implementation to
126+
serve this subset of cases does not seem to be justified. Therefore, this implementation has been discarded at this
127+
time.
128+
129+
### Other Unexplored Approaches
130+
* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to
131+
modifying them. It's possible such an approach could provide a similar approach to atomics, but with different
132+
overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring.
133+
134+
* Another possible approach could be to do something like the proposed implementation one, but with some sparse
135+
representation of output data. However, I think the general assumptions we can make about the normal case make this
136+
less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large
137+
percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads.
138+
This could be explored if we find temporary storage is too large for some cases and the atomic approach does not
139+
provide a good fallback.
140+
141+
## Proposal
142+
After exploring the above implementation for `histogram`, the following proposal better represents the use cases which
143+
are important, and provides reasonable performance for most cases.
144+
145+
### Embarrassingly Parallel Via Temporary Histograms
146+
This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the
147+
`histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra
148+
histogram-specific patterns in the implementation of a backend.
149+
150+
This algorithm does however require that each parallel backend will add a
151+
`__enumerable_thread_local_storage<_StoredType>` struct which provides the following:
152+
* constructor which takes a variadic list of args to pass to the constructor of each thread's object
153+
* `get_for_current_thread()` returns reference to the current thread's stored object
154+
* `get_with_id(int i)` returns reference to the stored object for an index
155+
* `size()` returns number of stored objects
156+
157+
In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, we implement our own similar
158+
thread local storage which will allocate and initialize the thread local storage at the first usage for each active
159+
thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial
160+
backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we
161+
use a single copy of histogram. In practice, our benchmarking reports little difference in performance between this
162+
implementation and the original, which directly accumulated to the output histogram.
163+
164+
With this new structure we will use the following algorithm:
165+
166+
1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into
167+
its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input
168+
element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability.
169+
2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the
170+
histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is
171+
divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the
172+
output histogram.
173+
174+
With the overhead associated with this algorithm, the implementation of each `parallel_for` may fallback to a serial
175+
implementation. It makes sense to include this as part of a future improvement of `parallel_for`, where a user could
176+
provide extra information in the call to influence details of the backend implementation from the non-background
177+
specific implementation code. Details which may be included could include grain size or a functor to determine fallback
178+
to serial implementation.
179+
180+
### Temporary Memory Requirements
181+
Both algorithms should have temporary memory complexity of `O(num_bins)`, and specifically will allocate `num_bins`
182+
output histogram typed elements for each thread used. Depending on the number of input elements, all available threads
183+
may not be used.
184+
185+
### Computational Complexity
186+
#### Even Bin API
187+
The proposed algorithm should have `O(N) + O(num_bins)` operations where `N` is the number of input elements, and
188+
`num_bins` is the number of histogram bins.
189+
190+
#### Custom Range Bin API
191+
The proposed algorithm should have `O(N * log(num_bins)) + O(num_bins)` operations where `N` is the number of input
192+
elements, and `num_bins` is the number of histogram bins.

0 commit comments

Comments
 (0)