|
| 1 | +# Host Backends Support for the Histogram APIs |
| 2 | + |
| 3 | +## Introduction |
| 4 | +The oneDPL library added histogram APIs, currently implemented only for device policies with the DPC++ backend. These |
| 5 | +APIs are defined in the oneAPI Specification 1.4. Please see the |
| 6 | +[oneAPI Specification](https://github.com/uxlfoundation/oneAPI-spec/blob/main/source/elements/oneDPL/source/parallel_api/algorithms.rst#parallel-algorithms) |
| 7 | +for details. The host-side backends (serial, TBB, OpenMP) are not yet supported. This RFC proposes extending histogram |
| 8 | +support to these backends. |
| 9 | + |
| 10 | +The pull request for the proposed implementation exists [here](https://github.com/oneapi-src/oneDPL/pull/1974). |
| 11 | + |
| 12 | +## Motivations |
| 13 | +There are many cases to use a host-side serial or a host-side implementation of histogram. Another motivation for adding |
| 14 | +the support is simply to be spec compliant with the oneAPI specification. |
| 15 | + |
| 16 | +## Design Considerations |
| 17 | + |
| 18 | +### Key Requirements |
| 19 | +Provide support for the `histogram` APIs with the following policies and backends: |
| 20 | +- Policies: `seq`, `unseq`, `par`, `par_unseq` |
| 21 | +- Backends: `serial`, `tbb`, `openmp` |
| 22 | + |
| 23 | +Users have a choice of execution policies when calling oneDPL APIs. They also have a number of options of backends which |
| 24 | +they can select from when using oneDPL. It is important that all combinations of these options have support for the |
| 25 | +`histogram` APIs. |
| 26 | + |
| 27 | +### Performance |
| 28 | +Histogram algorithms typically involve minimal computation and are likely to be memory-bound. So, the implementation prioritizes |
| 29 | +reducing memory accesses and minimizing temporary memory traffic. |
| 30 | + |
| 31 | +For CPU backends, we will focus on input sizes ranging from 32K to 4M elements and 32 - 4k histogram bins. Smaller sizes |
| 32 | +of input may best be suited for serial histogram implementation, and very large sizes may be better suited for GPU |
| 33 | +device targets. Histogram bin counts can vary from use case to use case, but the most common rule of thumb is to size |
| 34 | +the number of bins approximately to the cube root of the number of input elements. For our input size ranges this gives |
| 35 | +us a range of 32 - 256. In practice, some users find need to increase the number of bins beyond that rough rule. |
| 36 | +For this reason, we have aelected our histogram size range to 32 - 4k elements. |
| 37 | + |
| 38 | +### Memory Footprint |
| 39 | +There are no guidelines here from the standard library as this is an extension API. Still, we will minimize memory |
| 40 | +footprint where possible. |
| 41 | + |
| 42 | +### Code Reuse |
| 43 | +We want to minimize adding requirements for parallel backends to implement, and lift as much as possible to the |
| 44 | +algorithm implementation level. We should be able to avoid adding a `__parallel_histogram` call in the individual |
| 45 | +backends, and instead rely upon `__parallel_for`. |
| 46 | + |
| 47 | +### SIMD/openMP SIMD Implementation |
| 48 | +Currently oneDPL relies upon openMP SIMD to provide its vectorization, which is designed to provide vectorization across |
| 49 | +loop iterations, oneDPL does not directly use any intrinsics. |
| 50 | + |
| 51 | +There are a few parts of the histogram algorithm to consider. For the calculation to determine which bin to increment |
| 52 | +there are two APIs, even and custom range which have significantly different methods to determine the bin to increment. |
| 53 | +For the even bin API, the calculations to determine selected bin have some opportunity for vectorization as each input |
| 54 | +has the same mathematical operations applied to each. However, for the custom range API, each input element uses a |
| 55 | +binary search through a list of bin boundaries to determine the selected bin. This operation will have a different |
| 56 | +length and control flow based upon each input element and will be very difficult to vectorize. |
| 57 | + |
| 58 | +Next, let's consider the increment operation itself. This operation increments a data dependent bin location, and may |
| 59 | +result in conflicts between elements of the same vector. This increment operation therefore is unvectorizable without |
| 60 | +more complex handling. Some hardware does implement SIMD conflict detection via specific intrinsics, but this is not |
| 61 | +available via OpenMP SIMD. Alternatively, we can multiply our number of temporary histogram copies by a factor of the |
| 62 | +vector width, but it is unclear if it is worth the overhead. OpenMP SIMD provides an `ordered` structured block which |
| 63 | +we can use to exempt the increment from SIMD operations as well. However, this often results in vectorization being |
| 64 | +refused by the compiler. Initial implementation will avoid vectorization of this main histogram loop. |
| 65 | + |
| 66 | +Last, for our below proposed implementation there is the task of combining temporary histogram data into the global |
| 67 | +output histogram. This is directly vectorizable via our existing brick_walk implementation, and will be vectorized when |
| 68 | +a vector policy is used. |
| 69 | + |
| 70 | +### Serial Backend |
| 71 | +We plan to support a serial backend for histogram APIs in addition to openMP and TBB. This backend will handle all |
| 72 | +policies types, but always provide a serial unvectorized implementation. To make this backend compatible with the other |
| 73 | +approaches, we will use a single temporary histogram copy, which then is copied to the final global histogram. In our |
| 74 | +benchmarking, using a temporary copy performs similarly as compared to initializing and then accumulating directly into |
| 75 | +the output global histogram. There seems to be no performance motivated reason to special case the serial algorithm to |
| 76 | +use the global histogram directly. |
| 77 | + |
| 78 | +## Existing APIs / Patterns |
| 79 | + |
| 80 | +### count_if |
| 81 | +`histogram` is similar to `count_if` in that it conditionally increments a number of counters based upon the data in a |
| 82 | +sequence. `count_if` relies upon the `transform_reduce` pattern internally, and returns a scalar-typed value and doesn't |
| 83 | +provide any function to modify the variable being incremented. Using `count_if` without significant modification would |
| 84 | +require us to loop through the entire sequence for each output bin in the histogram. From a memory bandwidth |
| 85 | +perspective, this is untenable. Similarly, using a `histogram` pattern to implement `count_if` is unlikely to provide a |
| 86 | +well-performing result in the end, as contention should be far higher, and `transform_reduce` is a very well-matched |
| 87 | +pattern performance-wise. |
| 88 | + |
| 89 | +### parallel_for |
| 90 | +`parallel_for` is an interesting pattern in that it is very generic and embarrassingly parallel. This is close to what |
| 91 | +we need for `histogram`. However, we cannot simply use it without any added infrastructure. If we were to just use |
| 92 | +`parallel_for` alone, there would be a race condition between threads when incrementing the values in the output |
| 93 | +histogram. We should be able to use `parallel_for` as a building block for our implementation, but it requires some way |
| 94 | +to synchronize and accumulate between threads. |
| 95 | + |
| 96 | +## Alternative Approaches |
| 97 | + |
| 98 | +### Atomics |
| 99 | +This method uses atomic operations to remove the race conditions during accumulation. With atomic increments of the |
| 100 | +output histogram data, we can merely run a `parallel_for` pattern. |
| 101 | + |
| 102 | +To deal with atomics appropriately, we have some limitations. We must either use standard library atomics, atomics |
| 103 | +specific to a backend, or custom atomics specific to a compiler. `C++17` provides `std::atomic<T>`, however, this can |
| 104 | +only provide atomicity for data which is created with atomics in mind. This means allocating temporary data and then |
| 105 | +copying it to the output data. `C++20` provides `std::atomic_ref<T>` which would allow us to wrap user-provided output |
| 106 | +data in an atomic wrapper, but we cannot assume `C++20` for all users. OpenMP provides atomic operations, but that is |
| 107 | +only available for the OpenMP backend. The working plan was to implement a macro like `_ONEDPL_ATOMIC_INCREMENT(var)` |
| 108 | +which uses an `std::atomic_ref` if available, and alternatively uses compiler builtins like `InterlockedAdd` or |
| 109 | +`__atomic_fetch_add_n`. In a proof of concept implementation, this seemed to work, but does reach more into details than |
| 110 | +compiler / OS specifics than is desired for implementations prior to `C++20`. |
| 111 | + |
| 112 | +After experimenting with a proof of concept implementation of this implementation, it seems that the atomic |
| 113 | +implementation has very limited applicability to real cases. We explored a spectrum of number of elements combined with |
| 114 | +number of bins with both OpenMP and TBB. There was some subset of cases for which the atomics implementation |
| 115 | +outperformed the proposed implementation (below). However, this was generally limited to some specific cases where the |
| 116 | +number of bins was very large (~1 Million), and even for this subset significant benefit was only found for cases with a |
| 117 | +small number for input elements relative to number of bins. This makes sense because the atomic implementation is able |
| 118 | +to avoid the overhead of allocating and initializing temporary histogram copies, which is largest when the number of |
| 119 | +bins is large compared to the number of input elements. With many bins, contention on atomics is also limited as |
| 120 | +compared to the embarrassingly parallel proposal which does experience this contention. |
| 121 | + |
| 122 | +When we examine the real world utility of these cases, we find that they are uncommon and unlikely to be the important |
| 123 | +use cases. Histograms generally are used to categorize large images or arrays into a smaller number of bins to |
| 124 | +characterize the result. Cases for which there are similar or more bins than input elements are not very practical in |
| 125 | +practice. The maintenance and complexity cost associated with supporting and maintaining a second implementation to |
| 126 | +serve this subset of cases does not seem to be justified. Therefore, this implementation has been discarded at this |
| 127 | +time. |
| 128 | + |
| 129 | +### Other Unexplored Approaches |
| 130 | +* One could consider some sort of locking approach which locks mutexes for subsections of the output histogram prior to |
| 131 | +modifying them. It's possible such an approach could provide a similar approach to atomics, but with different |
| 132 | +overhead trade-offs. It seems quite likely that this would result in more overhead, but it could be worth exploring. |
| 133 | + |
| 134 | +* Another possible approach could be to do something like the proposed implementation one, but with some sparse |
| 135 | +representation of output data. However, I think the general assumptions we can make about the normal case make this |
| 136 | +less likely to be beneficial. It is quite likely that `n` is much larger than the output histograms, and that a large |
| 137 | +percentage of the output histogram may be occupied, even when considering dividing the input amongst multiple threads. |
| 138 | +This could be explored if we find temporary storage is too large for some cases and the atomic approach does not |
| 139 | +provide a good fallback. |
| 140 | + |
| 141 | +## Proposal |
| 142 | +After exploring the above implementation for `histogram`, the following proposal better represents the use cases which |
| 143 | +are important, and provides reasonable performance for most cases. |
| 144 | + |
| 145 | +### Embarrassingly Parallel Via Temporary Histograms |
| 146 | +This method uses temporary storage and a pair of calls to backend specific `parallel_for` functions to accomplish the |
| 147 | +`histogram`. These calls will use the existing infrastructure to provide properly composable parallelism, without extra |
| 148 | +histogram-specific patterns in the implementation of a backend. |
| 149 | + |
| 150 | +This algorithm does however require that each parallel backend will add a |
| 151 | +`__enumerable_thread_local_storage<_StoredType>` struct which provides the following: |
| 152 | +* constructor which takes a variadic list of args to pass to the constructor of each thread's object |
| 153 | +* `get_for_current_thread()` returns reference to the current thread's stored object |
| 154 | +* `get_with_id(int i)` returns reference to the stored object for an index |
| 155 | +* `size()` returns number of stored objects |
| 156 | + |
| 157 | +In the TBB backend, this will use `enumerable_thread_specific` internally. For OpenMP, we implement our own similar |
| 158 | +thread local storage which will allocate and initialize the thread local storage at the first usage for each active |
| 159 | +thread, similar to TBB. The serial backend will merely create a single copy of the temporary object for use. The serial |
| 160 | +backend does not technically need any thread specific storage, but to avoid special casing for this serial backend, we |
| 161 | +use a single copy of histogram. In practice, our benchmarking reports little difference in performance between this |
| 162 | +implementation and the original, which directly accumulated to the output histogram. |
| 163 | + |
| 164 | +With this new structure we will use the following algorithm: |
| 165 | + |
| 166 | +1) Run a `parallel_for` pattern which performs a `histogram` on the input sequence where each thread accumulates into |
| 167 | +its own temporary histogram returned by `__enumerable_thread_local_storage`. The parallelism is divided on the input |
| 168 | +element axis, and we rely upon existing `parallel_for` to implement chunksize and thread composability. |
| 169 | +2) Run a second `parallel_for` over the `histogram` output sequence which accumulates all temporary copies of the |
| 170 | +histogram created within `__enumerable_thread_local_storage` into the output histogram sequence. The parallelism is |
| 171 | +divided on the histogram bin axis, and each chunk loops through all temporary histograms to accumulate into the |
| 172 | +output histogram. |
| 173 | + |
| 174 | +With the overhead associated with this algorithm, the implementation of each `parallel_for` may fallback to a serial |
| 175 | +implementation. It makes sense to include this as part of a future improvement of `parallel_for`, where a user could |
| 176 | +provide extra information in the call to influence details of the backend implementation from the non-background |
| 177 | +specific implementation code. Details which may be included could include grain size or a functor to determine fallback |
| 178 | +to serial implementation. |
| 179 | + |
| 180 | +### Temporary Memory Requirements |
| 181 | +Both algorithms should have temporary memory complexity of `O(num_bins)`, and specifically will allocate `num_bins` |
| 182 | +output histogram typed elements for each thread used. Depending on the number of input elements, all available threads |
| 183 | +may not be used. |
| 184 | + |
| 185 | +### Computational Complexity |
| 186 | +#### Even Bin API |
| 187 | +The proposed algorithm should have `O(N) + O(num_bins)` operations where `N` is the number of input elements, and |
| 188 | +`num_bins` is the number of histogram bins. |
| 189 | + |
| 190 | +#### Custom Range Bin API |
| 191 | +The proposed algorithm should have `O(N * log(num_bins)) + O(num_bins)` operations where `N` is the number of input |
| 192 | +elements, and `num_bins` is the number of histogram bins. |
0 commit comments