|
| 1 | +Parallel STL Usage Instructions |
| 2 | +################################ |
| 3 | +Follow these steps to use Parallel STL. |
| 4 | + |
| 5 | +Use the DPC++ Policy |
| 6 | +===================== |
| 7 | +The DPC++ execution policy specifies where and how a Parallel STL algorithm runs. It inherits a standard C++ execution policy, encapsulates a SYCL* device or queue, and allows you to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies according to the ISO/IEC 14882:2017 standard. |
| 8 | + |
| 9 | +Use the policy: |
| 10 | + |
| 11 | +#. Add ``#include <dpstd/execution>`` to your code. |
| 12 | +#. Create a policy object by providing a standard policy type (currently, only the ``parallel_unsequenced_policy`` is supported), a class type for a unique kernel name as a template argument (this is optional, see the **Note** below), and one of the following constructor arguments: |
| 13 | + |
| 14 | + a. a SYCL queue; |
| 15 | + #. a SYCL device; |
| 16 | + #. a SYCL device selector; |
| 17 | + #. an existing policy object with a different kernel name. |
| 18 | + |
| 19 | +#. Pass the created policy object to a Parallel STL algorithm. |
| 20 | + |
| 21 | +``dpstd::execution::default_policy`` object is a predefined object of the ``device_policy`` class created with default kernel name and default queue. Use it to create customized policy objects, or pass directly when invoking an algorithm. |
| 22 | + |
| 23 | +:Note: Providing a kernel name for a policy is optional if the host code used to invoke the kernel is compiled with the Intel® oneAPI DPC++ Compiler. Otherwise you can instead add the ``-fsycl-unnamed-lambda`` option to the compilation command. This compilation option is required if you use the ``dpstd::execution::default_policy`` policy object in the code. |
| 24 | + |
| 25 | +DPC++ Policy Usage Examples |
| 26 | +============================ |
| 27 | +Code examples below assume ``using namespace dpstd::execution;`` and ``using namespace cl::sycl;`` directives when refer to policy classes and functions: |
| 28 | + |
| 29 | +.. code:: cpp |
| 30 | +
|
| 31 | + auto policy_a = device_policy<parallel_unsequenced_policy, class PolicyA> {queue{}}; |
| 32 | + std::for_each(policy_a, …); |
| 33 | + |
| 34 | +.. code:: cpp |
| 35 | +
|
| 36 | + auto policy_b = device_policy<parallel_unsequenced_policy, class PolicyB> {device{gpu_selector{}}}; |
| 37 | + std::for_each(policy_b, …); |
| 38 | +
|
| 39 | +.. code:: cpp |
| 40 | +
|
| 41 | + auto policy_c = device_policy<parallel_unsequenced_policy, class PolicyС> {default_selector{}}; |
| 42 | + std::for_each(policy_c, …); |
| 43 | +
|
| 44 | +.. code:: cpp |
| 45 | +
|
| 46 | + auto policy_d = make_device_policy<class PolicyD>(default_policy); |
| 47 | + std::for_each(policy_d, …); |
| 48 | +
|
| 49 | +.. code:: cpp |
| 50 | +
|
| 51 | + auto policy_e = make_device_policy<class PolicyE>(queue{}); |
| 52 | + std::for_each(policy_e, …); |
| 53 | +
|
| 54 | +.. code:: cpp |
| 55 | +
|
| 56 | + auto policy_f = make_device_policy<class PolicyF>(queue{property::queue::in_order()}); |
| 57 | + std::for_each(policy_f, …); |
| 58 | +
|
| 59 | +Use the FPGA policy |
| 60 | +==================== |
| 61 | +The ``fpga_device_policy`` class is a DPC++ policy tailored to achieve better performance of parallel algorithms on FPGA hardware devices. |
| 62 | + |
| 63 | +Use the policy when you're going to run the application on FPGA hardware device or FPGA emulation device: |
| 64 | + |
| 65 | +#. Define the ``_PSTL_FPGA_DEVICE`` macro to run on FPGA devices and additionally ``_PSTL_FPGA_EMU`` to run on FPGA emulation device. |
| 66 | +#. Add ``#include <dpstd/execution>`` to your code. |
| 67 | +#. Create a policy object by providing a class type for a unique kernel name and an unroll factor (see the **Note** below) as template arguments (both optional), and one of the following constructor arguments: |
| 68 | + |
| 69 | + a. A SYCL queue constructed for `the FPGA selector <https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/IntelFPGA/FPGASelector.md>`_ (the behavior is undefined with any other queue). |
| 70 | + #. An existing FPGA policy object with a different kernel name and/or unroll factor. |
| 71 | + |
| 72 | +#. Pass the created policy object to a Parallel STL algorithm. |
| 73 | + |
| 74 | +The default constructor of ``fpga_device_policy`` creates an object with a SYCL queue constructed for ``fpga_selector``, or for ``fpga_emulator_selector`` if ``_PSTL_FPGA_EMU`` is defined. |
| 75 | + |
| 76 | +``dpstd::execution::fpga_policy`` is a predefined object of the ``fpga_device_policy`` class created with default kernel name and default unroll factor. Use it to create customized policy objects, or pass directly when invoking an algorithm. |
| 77 | + |
| 78 | +:Note: Specifying unroll factor for a policy enables loop unrolling in the implementation of algorithms. Default value is 1. To find out how to choose a better value, you can refer to `unroll Pragma <https://software.intel.com/en-us/oneapi-fpga-optimization-guide-unroll-pragma>`_ and `Loops Analysis <https://software.intel.com/en-us/oneapi-fpga-optimization-guide-loops-analysis>`_ chapters of the `Intel(R) oneAPI DPC++ FPGA Optimization Guide <https://software.intel.com/en-us/oneapi-fpga-optimization-guide>`_. |
| 79 | + |
| 80 | +FPGA Policy Usage Examples |
| 81 | +=========================== |
| 82 | +The code below assumes ``using namespace dpstd::execution;`` for policies and ``using namespace cl::sycl;`` for queues and device selectors: |
| 83 | + |
| 84 | +.. code:: cpp |
| 85 | +
|
| 86 | + auto fpga_policy_a = fpga_device_policy<class FPGAPolicyA>{}; |
| 87 | + auto fpga_policy_b = make_fpga_policy(queue{intel::fpga_selector{}}); |
| 88 | + constexpr auto unroll_factor = 8; |
| 89 | + auto fpga_policy_c = make_fpga_policy<class FPGAPolicyC, unroll_factor>(fpga_policy); |
| 90 | +
|
| 91 | +Include Parallel STL Header Files |
| 92 | +================================== |
| 93 | +To include Parallel STL header files, add a subset of the following set of lines. These lines are dependent on the algorithms you intend to use: |
| 94 | + |
| 95 | +- ``#include <dpstd/algorithm>`` |
| 96 | +- ``#include <dpstd/numeric>`` |
| 97 | +- ``#include <dpstd/memory>`` |
| 98 | + |
| 99 | +Use dpstd::begin, dpstd::end Functions |
| 100 | +======================================= |
| 101 | + |
| 102 | +The ``dpstd::begin`` and ``dpstd::end`` are special helper functions that allow you to pass SYCL buffers to Parallel STL algorithms. These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements: |
| 103 | + |
| 104 | +- Is ``CopyConstructible``, ``CopyAssignable``, and comparable with operators == and !=. |
| 105 | +- The following expressions are valid: ``a + n``, ``a - n``, and ``a - b``, where ``a`` and ``b`` are objects of the type, and ``n`` is an integer value. |
| 106 | +- Has a ``get_buffer`` method with no arguments. The method returns the SYCL buffer passed to ``dpstd::begin`` and ``dpstd::end`` functions. |
| 107 | + |
| 108 | +To use the functions, add ``#include <dpstd/iterator>`` to your code. |
| 109 | + |
| 110 | +Example: |
| 111 | + |
| 112 | +.. code:: cpp |
| 113 | +
|
| 114 | + #include <CL/sycl.hpp> |
| 115 | + #include <dpstd/execution> |
| 116 | + #include <dpstd/algorithm> |
| 117 | + #include <dpstd/iterator> |
| 118 | + int main(){ |
| 119 | + cl::sycl::queue q; |
| 120 | + cl::sycl::buffer<int> buf { 1000 }; |
| 121 | + auto buf_begin = dpstd::begin(buf); |
| 122 | + auto buf_end = dpstd::end(buf); |
| 123 | + auto policy = dpstd::execution::make_device_policy<class fill>( q ); |
| 124 | + std::fill(policy, buf_begin, buf_end, 42); |
| 125 | + return 0; |
| 126 | + } |
| 127 | +
|
| 128 | +:Note: Parallel STL algorithms can be called with ordinary (host-side) iterators, as seen in the code example below. In this case, a temporary SYCL buffer is created and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working with SYCL buffers is recommended to reduce data copying between the host and device. |
| 129 | + |
| 130 | +Example: |
| 131 | + |
| 132 | +.. code:: cpp |
| 133 | +
|
| 134 | + #include <vector> |
| 135 | + #include <dpstd/execution> |
| 136 | + #include <dpstd/algorithm> |
| 137 | + int main(){ |
| 138 | + std::vector<int> v( 1000000 ); |
| 139 | + std::fill(dpstd::execution::default_policy, v.begin(), v.end(), 42); |
| 140 | + // each element of vec equals to 42 |
| 141 | + return 0; |
| 142 | + } |
| 143 | +
|
| 144 | +Use Parallel STL with Unified Shared Memory (USM) |
| 145 | +================================================== |
| 146 | +The following examples demonstrate two ways to use the Parallel STL algorithms with USM: |
| 147 | + |
| 148 | +- USM pointers |
| 149 | +- USM allocators |
| 150 | + |
| 151 | +If you have a USM-allocated buffer, pass the pointers to the start and past the end of the buffer to a parallel algorithm. Make sure that the execution policy and the buffer were created for the same queue or context. |
| 152 | + |
| 153 | +If the same buffer is processed by several algorithms, either use an ordered queue or explicitly wait for completion of each algorithm before passing the buffer to the next one. Also wait for completion before accessing the data at the host. |
| 154 | + |
| 155 | +.. code:: cpp |
| 156 | +
|
| 157 | + #include <CL/sycl.hpp> |
| 158 | + #include <dpstd/execution> |
| 159 | + #include <dpstd/algorithm> |
| 160 | + int main(){ |
| 161 | + cl::sycl::queue q; |
| 162 | + const int n = 1000; |
| 163 | + int* d_head = static_cast<int*>(cl::sycl::malloc_device(n * sizeof(int), |
| 164 | + q.get_device(), q.get_context())); |
| 165 | +
|
| 166 | + std::fill(dpstd::execution::make_device_policy(q), d_head, d_head + n, 42); |
| 167 | + q.wait(); |
| 168 | + cl::sycl::free(d_head, q.get_context()); |
| 169 | + return 0; |
| 170 | + } |
| 171 | +
|
| 172 | +Alternatively, use ``std::vector`` with a USM allocator: |
| 173 | + |
| 174 | +.. code:: cpp |
| 175 | +
|
| 176 | + #include <CL/sycl.hpp> |
| 177 | + #include <dpstd/execution> |
| 178 | + #include <dpstd/algorithm> |
| 179 | + int main(){ |
| 180 | + cl::sycl::queue q; |
| 181 | + const int n = 1000; |
| 182 | + cl::sycl::usm_allocator<int, cl::sycl::usm::alloc::shared> alloc(q.get_context(), q.get_device()); |
| 183 | + std::vector<int, decltype(alloc)> vec(n, alloc); |
| 184 | +
|
| 185 | + std::fill(dpstd::execution::make_device_policy(q), vec.begin(), vec.end(), 42); |
| 186 | + q.wait(); |
| 187 | +
|
| 188 | + return 0; |
| 189 | + } |
| 190 | +
|
| 191 | +Error handling with DPC++ execution policies |
| 192 | +============================================= |
| 193 | +The DPC++ error handling model supports two types of errors. In case of *synchronous* errors DPC++ host runtime libraries throw exceptions, while *asynchronous* errors may only be processed in a user-supplied error handler associated with a DPC++ queue. |
| 194 | + |
| 195 | +For Parallel STL algorithms executed with DPC++ policies, handling all errors, synchronous or asynchronous, is a responsibility of the caller. |
| 196 | +Specifically, |
| 197 | + |
| 198 | +* no exceptions are thrown explicitly by algorithms; |
| 199 | +* exceptions thrown by runtime libraries at the host CPU, including DPC++ synchronous exceptions, are passed through to the caller; |
| 200 | +* DPC++ asynchronous errors are not handled. |
| 201 | + |
| 202 | +In order to process DPC++ asynchronous errors, the queue associated with a DPC++ policy must be created with an error handler object. |
| 203 | +The predefined policy objects (``default_policy`` etc.) have no error handlers; do not use those if you need to process asynchronous errors. |
| 204 | + |
| 205 | +Additional Macros |
| 206 | +================== |
| 207 | + |
| 208 | +================================= ============================== |
| 209 | +Macro Description |
| 210 | +================================= ============================== |
| 211 | +``_PSTL_BACKEND_SYCL`` This macro enables the use of the DPC++ policy. (This is enabled by default when compiling with the Intel® oneAPI DPC++ Compiler, otherwise it is disabled.) |
| 212 | +--------------------------------- ------------------------------ |
| 213 | +``_PSTL_FPGA_DEVICE`` Use this macro to build your code containing Parallel STL algorithms for FPGA devices. (Disabled by default.) |
| 214 | +--------------------------------- ------------------------------ |
| 215 | +``_PSTL_FPGA_EMU`` Use this macro to build your code containing Parallel STL algorithms for FPGA emulation device. (Disabled by default.) |
| 216 | +--------------------------------- ------------------------------ |
| 217 | +``_PSTL_COMPILE_KERNEL`` Use this macro to get rid of the ``CL_OUT_OF_RESOURCES`` exception that may occur during some invocations of Parallel STL algorithms on CPU and FPGA devices. The macro may increase the execution time of the algorithms. (Enabled by default.) |
| 218 | +================================= ============================== |
| 219 | + |
| 220 | +:Note: Define both ``_PSTL_FPGA_DEVICE`` and ``_PSTL_FPGA_EMU`` macros in the same application to run on FPGA emulation device. To run on FPGA hardware device only ``_PSTL_FPGA_DEVICE`` should be defined. |
| 221 | + |
| 222 | +Build Your Code with Parallel STL for DPC++ |
| 223 | +============================================ |
| 224 | +Use these steps to build your code with Parallel STL for DPC++. |
| 225 | + |
| 226 | +#. To build with the Intel® oneAPI DPC++ Compiler, see the Get Started with the Intel® oneAPI DPC++ Compiler for details. |
| 227 | +#. Set the environment for oneAPI Data Parallel C++ Library and oneAPI Threading Building Blocks. |
| 228 | +#. To avoid naming device policy objects explicitly, add the ``–fsycl-unnamed-lambda`` option. |
| 229 | + |
| 230 | +Below is an example of a command line used to compile code that contains Parallel STL algorithms on Linux (depending on the code, parameters within [] could be unnecessary): |
| 231 | + |
| 232 | +.. code:: |
| 233 | +
|
| 234 | + dpcpp [–fsycl-unnamed-lambda] test.cpp [-ltbb] -o test |
0 commit comments