Skip to content

Commit 8697af5

Browse files
committed
Add library guide
Signed-off-by: gejin <[email protected]>
1 parent a140169 commit 8697af5

File tree

5 files changed

+664
-0
lines changed

5 files changed

+664
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
Before You Begin
2+
#################
3+
Before you begin content for the Intel® oneAPI DPC++ Library (oneDPL).
4+
5+
Install the `Intel® oneAPI Base Toolkit <https://software.intel.com/en-us/oneapi/base-kit>`_ to use oneDPL.
6+
7+
To use Parallel STL or the Extension API, include the corresponding header files in your source code. All oneDPL header files are in the ``dpstd`` directory. Use ``#include <dpstd/…>`` to include them. Intel® oneAPI DPC++ Library uses namespace ``dpstd`` for the Extension API classes and functions.
8+
9+
To use tested C++ standard APIs, you need to include the corresponding C++ standard header files and use the ``std`` namespace.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
Extension API
2+
################################
3+
A list of the Extension API algorithms and functional utility classes.
4+
5+
The Extension API currently includes six algorithms and three functional utility classes. The algorithms include segmented reduce, segmented scan and vectorized search algorithms. Detailed descriptions of each algorithm are provided below.
6+
7+
8+
- reduce_by_segment
9+
10+
The ``reduce_by_segment`` algorithm performs partial reductions on a sequence's values and keys. Each reduction is computed with a given reduction operation for a contiguous subsequence of values, which are determined by keys being equal according to a predicate. A return value is a pair of iterators holding the end of the output sequences for keys and values.
11+
12+
For correct computation, the reduction operation should be associative. If no operation is specified, the default operation for the reduction is ``std::plus``, and the default predicate is ``std::equal_to``. The algorithm requires that the type of the elements used for values be default constructible.
13+
14+
Example::
15+
16+
keys: [0,0,0,1,1,1]
17+
values: [1,2,3,4,5,6]
18+
output_keys: [0,1]
19+
output_values: [1+2+3=6,4+5+6=15]
20+
21+
- inclusive_scan_by_segment
22+
23+
The ``inclusive_scan_by_segment`` algorithm performs partial prefix scans on a sequence's values. Each scan applies to a contiguous subsequence of values, which are determined by the keys associated with the values being equal. The return value is an iterator targeting the end of the result sequence.
24+
25+
For correct computation, the prefix scan operation should be associative. If no operation is specified, the default operation is ``std::plus``, and the default predicate is ``std::equal_to``. The algorithm requires that the type of the elements used for values be default constructible.
26+
27+
Example::
28+
29+
keys: [0,0,0,1,1,1]
30+
values: [1,2,3,4,5,6]
31+
result: [1,1+2=3,1+2+3=6,4,4+5=9,4+5+6=15]
32+
33+
- exclusive_scan_by_segment
34+
35+
The ``exclusive_scan_by_segment`` algorithm performs partial prefix scans on a sequence's values. Each scan applies to a contiguous subsequence of values that are determined by the keys associated with the values being equal, and sets the first element to the initial value provided. The return value is an iterator targeting the end of the result sequence.
36+
37+
For correct computation, the prefix scan operation should be associative. If no operation is specified, the default operation is ``std::plus``, and the default predicate is ``std::equal_to``.
38+
39+
Example::
40+
41+
keys: [0,0,0,1,1,1]
42+
values: [1,2,3,4,5,6]
43+
initial value: [0]
44+
result: [0,0+1=1,0+1+2=3,0,0+4=4,0+4+5=9]
45+
46+
- binary_search
47+
48+
The ``binary_search`` algorithm performs a binary search of the input sequence for each of the values in the search sequence provided. The result of a search for the *i-th* element of the search sequence, a boolean value indicating whether the search value was found in the input sequence, is assigned to the *i-th* element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. The algorithm assumes the input sequence has been sorted by the comparator provided. If no comparator is provided then a function object that uses ``operator<`` to compare the elements will be used.
49+
50+
Example::
51+
52+
input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6]
53+
search sequence: [0, 2, 4, 7, 6]
54+
result sequence: [true, true, false, false, true]
55+
56+
- lower_bound
57+
58+
The ``lower_bound`` algorithm performs a binary search of the input sequence for each of the values in the search sequence provided to identify the lowest index in the input sequence where the search value could be inserted without violating the ordering provided by the comparator used to sort the input sequence. The result of a search for the *i-th* element of the search sequence, the first index in the input sequence where the search value could be inserted without violating the ordering of the input sequence, is assigned to the *i-th* element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. If no comparator is provided then a function object that uses ``operator<`` to compare the elements will be used.
59+
60+
Example::
61+
62+
input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6]
63+
search sequence: [0, 2, 4, 7, 6]
64+
result sequence: [0, 1, 8, 10, 8]
65+
66+
67+
- upper_bound
68+
69+
The ``upper_bound`` algorithm performs a binary search of the input sequence for each of the values in the search sequence provided to identify the highest index in the input sequence where the search value could be inserted without violating the ordering provided by the comparator used to sort the input sequence. The result of a search for the *i-th* element of the search sequence, the last index in the input sequence where the search value could be inserted without violating the ordering of the input sequence, is assigned to the *i-th* element of the result sequence. The algorithm returns an iterator that points to one past the last element of the result sequence that was assigned a result. If no comparator is provided then a function object that uses ``operator<`` to compare the elements will be used.
70+
71+
Example::
72+
73+
input sequence: [0, 2, 2, 2, 3, 3, 3, 3, 6, 6]
74+
search sequence: [0, 2, 4, 7, 6]
75+
result sequence: [1, 4, 8, 10, 10]
76+
77+
Here are the details for the functional utility classes:
78+
79+
- identity: A C++11 implementation of the C++20 ``std::identity`` function object type, where the operator() returns the argument unchanged.
80+
- minimum: A function object type where the operator() applies ``std::less`` to its arguments, then returns the lesser argument unchanged.
81+
- maximum: A function object type where the operator() applies ``std::greater`` to its arguments, then returns the greater argument unchanged.

documentation/library_guide/pstl.rst

+17
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Parallel STL
2+
#############
3+
4+
An Introduction to Parallel STL.
5+
6+
Parallel STL offers efficient support for both parallel and vectorized execution of algorithms for Intel® processors. For sequential execution on a host computer, it relies on an available implementation of a C++ Standard Library.
7+
8+
Visit the `Release Notes <https://software.intel.com/en-us/articles/intel-oneapi-dpcpp-library-release-notes-beta>`_ page for:
9+
10+
- Where to Find the Release
11+
- Overview
12+
- New in this Release
13+
- Known Issues
14+
15+
Read the article `Get Started with Parallel STL <https://software.intel.com/en-us/articles/get-started-with-parallel-stl>`_ for prerequisites and general information related to Parallel STL.
16+
17+
Parallel STL for DPC++ is extended with support for DPC++ compliant devices, special DPC++ execution policies, and ``dpstd::begin``, ``dpstd::end`` functions.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,234 @@
1+
Parallel STL Usage Instructions
2+
################################
3+
Follow these steps to use Parallel STL.
4+
5+
Use the DPC++ Policy
6+
=====================
7+
The DPC++ execution policy specifies where and how a Parallel STL algorithm runs. It inherits a standard C++ execution policy, encapsulates a SYCL* device or queue, and allows you to set an optional kernel name. DPC++ execution policies can be used with all standard C++ algorithms that support execution policies according to the ISO/IEC 14882:2017 standard.
8+
9+
Use the policy:
10+
11+
#. Add ``#include <dpstd/execution>`` to your code.
12+
#. Create a policy object by providing a standard policy type (currently, only the ``parallel_unsequenced_policy`` is supported), a class type for a unique kernel name as a template argument (this is optional, see the **Note** below), and one of the following constructor arguments:
13+
14+
a. a SYCL queue;
15+
#. a SYCL device;
16+
#. a SYCL device selector;
17+
#. an existing policy object with a different kernel name.
18+
19+
#. Pass the created policy object to a Parallel STL algorithm.
20+
21+
``dpstd::execution::default_policy`` object is a predefined object of the ``device_policy`` class created with default kernel name and default queue. Use it to create customized policy objects, or pass directly when invoking an algorithm.
22+
23+
:Note: Providing a kernel name for a policy is optional if the host code used to invoke the kernel is compiled with the Intel® oneAPI DPC++ Compiler. Otherwise you can instead add the ``-fsycl-unnamed-lambda`` option to the compilation command. This compilation option is required if you use the ``dpstd::execution::default_policy`` policy object in the code.
24+
25+
DPC++ Policy Usage Examples
26+
============================
27+
Code examples below assume ``using namespace dpstd::execution;`` and ``using namespace cl::sycl;`` directives when refer to policy classes and functions:
28+
29+
.. code:: cpp
30+
31+
auto policy_a = device_policy<parallel_unsequenced_policy, class PolicyA> {queue{}};
32+
std::for_each(policy_a, …);
33+
34+
.. code:: cpp
35+
36+
auto policy_b = device_policy<parallel_unsequenced_policy, class PolicyB> {device{gpu_selector{}}};
37+
std::for_each(policy_b, …);
38+
39+
.. code:: cpp
40+
41+
auto policy_c = device_policy<parallel_unsequenced_policy, class PolicyС> {default_selector{}};
42+
std::for_each(policy_c, …);
43+
44+
.. code:: cpp
45+
46+
auto policy_d = make_device_policy<class PolicyD>(default_policy);
47+
std::for_each(policy_d, …);
48+
49+
.. code:: cpp
50+
51+
auto policy_e = make_device_policy<class PolicyE>(queue{});
52+
std::for_each(policy_e, …);
53+
54+
.. code:: cpp
55+
56+
auto policy_f = make_device_policy<class PolicyF>(queue{property::queue::in_order()});
57+
std::for_each(policy_f, …);
58+
59+
Use the FPGA policy
60+
====================
61+
The ``fpga_device_policy`` class is a DPC++ policy tailored to achieve better performance of parallel algorithms on FPGA hardware devices.
62+
63+
Use the policy when you're going to run the application on FPGA hardware device or FPGA emulation device:
64+
65+
#. Define the ``_PSTL_FPGA_DEVICE`` macro to run on FPGA devices and additionally ``_PSTL_FPGA_EMU`` to run on FPGA emulation device.
66+
#. Add ``#include <dpstd/execution>`` to your code.
67+
#. Create a policy object by providing a class type for a unique kernel name and an unroll factor (see the **Note** below) as template arguments (both optional), and one of the following constructor arguments:
68+
69+
a. A SYCL queue constructed for `the FPGA selector <https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/IntelFPGA/FPGASelector.md>`_ (the behavior is undefined with any other queue).
70+
#. An existing FPGA policy object with a different kernel name and/or unroll factor.
71+
72+
#. Pass the created policy object to a Parallel STL algorithm.
73+
74+
The default constructor of ``fpga_device_policy`` creates an object with a SYCL queue constructed for ``fpga_selector``, or for ``fpga_emulator_selector`` if ``_PSTL_FPGA_EMU`` is defined.
75+
76+
``dpstd::execution::fpga_policy`` is a predefined object of the ``fpga_device_policy`` class created with default kernel name and default unroll factor. Use it to create customized policy objects, or pass directly when invoking an algorithm.
77+
78+
:Note: Specifying unroll factor for a policy enables loop unrolling in the implementation of algorithms. Default value is 1. To find out how to choose a better value, you can refer to `unroll Pragma <https://software.intel.com/en-us/oneapi-fpga-optimization-guide-unroll-pragma>`_ and `Loops Analysis <https://software.intel.com/en-us/oneapi-fpga-optimization-guide-loops-analysis>`_ chapters of the `Intel(R) oneAPI DPC++ FPGA Optimization Guide <https://software.intel.com/en-us/oneapi-fpga-optimization-guide>`_.
79+
80+
FPGA Policy Usage Examples
81+
===========================
82+
The code below assumes ``using namespace dpstd::execution;`` for policies and ``using namespace cl::sycl;`` for queues and device selectors:
83+
84+
.. code:: cpp
85+
86+
auto fpga_policy_a = fpga_device_policy<class FPGAPolicyA>{};
87+
auto fpga_policy_b = make_fpga_policy(queue{intel::fpga_selector{}});
88+
constexpr auto unroll_factor = 8;
89+
auto fpga_policy_c = make_fpga_policy<class FPGAPolicyC, unroll_factor>(fpga_policy);
90+
91+
Include Parallel STL Header Files
92+
==================================
93+
To include Parallel STL header files, add a subset of the following set of lines. These lines are dependent on the algorithms you intend to use:
94+
95+
- ``#include <dpstd/algorithm>``
96+
- ``#include <dpstd/numeric>``
97+
- ``#include <dpstd/memory>``
98+
99+
Use dpstd::begin, dpstd::end Functions
100+
=======================================
101+
102+
The ``dpstd::begin`` and ``dpstd::end`` are special helper functions that allow you to pass SYCL buffers to Parallel STL algorithms. These functions accept a SYCL buffer and return an object of an unspecified type that satisfies the following requirements:
103+
104+
- Is ``CopyConstructible``, ``CopyAssignable``, and comparable with operators == and !=.
105+
- The following expressions are valid: ``a + n``, ``a - n``, and ``a - b``, where ``a`` and ``b`` are objects of the type, and ``n`` is an integer value.
106+
- Has a ``get_buffer`` method with no arguments. The method returns the SYCL buffer passed to ``dpstd::begin`` and ``dpstd::end`` functions.
107+
108+
To use the functions, add ``#include <dpstd/iterator>`` to your code.
109+
110+
Example:
111+
112+
.. code:: cpp
113+
114+
#include <CL/sycl.hpp>
115+
#include <dpstd/execution>
116+
#include <dpstd/algorithm>
117+
#include <dpstd/iterator>
118+
int main(){
119+
cl::sycl::queue q;
120+
cl::sycl::buffer<int> buf { 1000 };
121+
auto buf_begin = dpstd::begin(buf);
122+
auto buf_end = dpstd::end(buf);
123+
auto policy = dpstd::execution::make_device_policy<class fill>( q );
124+
std::fill(policy, buf_begin, buf_end, 42);
125+
return 0;
126+
}
127+
128+
:Note: Parallel STL algorithms can be called with ordinary (host-side) iterators, as seen in the code example below. In this case, a temporary SYCL buffer is created and the data is copied to this buffer. After processing of the temporary buffer on a device is complete, the data is copied back to the host. Working with SYCL buffers is recommended to reduce data copying between the host and device.
129+
130+
Example:
131+
132+
.. code:: cpp
133+
134+
#include <vector>
135+
#include <dpstd/execution>
136+
#include <dpstd/algorithm>
137+
int main(){
138+
std::vector<int> v( 1000000 );
139+
std::fill(dpstd::execution::default_policy, v.begin(), v.end(), 42);
140+
// each element of vec equals to 42
141+
return 0;
142+
}
143+
144+
Use Parallel STL with Unified Shared Memory (USM)
145+
==================================================
146+
The following examples demonstrate two ways to use the Parallel STL algorithms with USM:
147+
148+
- USM pointers
149+
- USM allocators
150+
151+
If you have a USM-allocated buffer, pass the pointers to the start and past the end of the buffer to a parallel algorithm. Make sure that the execution policy and the buffer were created for the same queue or context.
152+
153+
If the same buffer is processed by several algorithms, either use an ordered queue or explicitly wait for completion of each algorithm before passing the buffer to the next one. Also wait for completion before accessing the data at the host.
154+
155+
.. code:: cpp
156+
157+
#include <CL/sycl.hpp>
158+
#include <dpstd/execution>
159+
#include <dpstd/algorithm>
160+
int main(){
161+
cl::sycl::queue q;
162+
const int n = 1000;
163+
int* d_head = static_cast<int*>(cl::sycl::malloc_device(n * sizeof(int),
164+
q.get_device(), q.get_context()));
165+
166+
std::fill(dpstd::execution::make_device_policy(q), d_head, d_head + n, 42);
167+
q.wait();
168+
cl::sycl::free(d_head, q.get_context());
169+
return 0;
170+
}
171+
172+
Alternatively, use ``std::vector`` with a USM allocator:
173+
174+
.. code:: cpp
175+
176+
#include <CL/sycl.hpp>
177+
#include <dpstd/execution>
178+
#include <dpstd/algorithm>
179+
int main(){
180+
cl::sycl::queue q;
181+
const int n = 1000;
182+
cl::sycl::usm_allocator<int, cl::sycl::usm::alloc::shared> alloc(q.get_context(), q.get_device());
183+
std::vector<int, decltype(alloc)> vec(n, alloc);
184+
185+
std::fill(dpstd::execution::make_device_policy(q), vec.begin(), vec.end(), 42);
186+
q.wait();
187+
188+
return 0;
189+
}
190+
191+
Error handling with DPC++ execution policies
192+
=============================================
193+
The DPC++ error handling model supports two types of errors. In case of *synchronous* errors DPC++ host runtime libraries throw exceptions, while *asynchronous* errors may only be processed in a user-supplied error handler associated with a DPC++ queue.
194+
195+
For Parallel STL algorithms executed with DPC++ policies, handling all errors, synchronous or asynchronous, is a responsibility of the caller.
196+
Specifically,
197+
198+
* no exceptions are thrown explicitly by algorithms;
199+
* exceptions thrown by runtime libraries at the host CPU, including DPC++ synchronous exceptions, are passed through to the caller;
200+
* DPC++ asynchronous errors are not handled.
201+
202+
In order to process DPC++ asynchronous errors, the queue associated with a DPC++ policy must be created with an error handler object.
203+
The predefined policy objects (``default_policy`` etc.) have no error handlers; do not use those if you need to process asynchronous errors.
204+
205+
Additional Macros
206+
==================
207+
208+
================================= ==============================
209+
Macro Description
210+
================================= ==============================
211+
``_PSTL_BACKEND_SYCL`` This macro enables the use of the DPC++ policy. (This is enabled by default when compiling with the Intel® oneAPI DPC++ Compiler, otherwise it is disabled.)
212+
--------------------------------- ------------------------------
213+
``_PSTL_FPGA_DEVICE`` Use this macro to build your code containing Parallel STL algorithms for FPGA devices. (Disabled by default.)
214+
--------------------------------- ------------------------------
215+
``_PSTL_FPGA_EMU`` Use this macro to build your code containing Parallel STL algorithms for FPGA emulation device. (Disabled by default.)
216+
--------------------------------- ------------------------------
217+
``_PSTL_COMPILE_KERNEL`` Use this macro to get rid of the ``CL_OUT_OF_RESOURCES`` exception that may occur during some invocations of Parallel STL algorithms on CPU and FPGA devices. The macro may increase the execution time of the algorithms. (Enabled by default.)
218+
================================= ==============================
219+
220+
:Note: Define both ``_PSTL_FPGA_DEVICE`` and ``_PSTL_FPGA_EMU`` macros in the same application to run on FPGA emulation device. To run on FPGA hardware device only ``_PSTL_FPGA_DEVICE`` should be defined.
221+
222+
Build Your Code with Parallel STL for DPC++
223+
============================================
224+
Use these steps to build your code with Parallel STL for DPC++.
225+
226+
#. To build with the Intel® oneAPI DPC++ Compiler, see the Get Started with the Intel® oneAPI DPC++ Compiler for details.
227+
#. Set the environment for oneAPI Data Parallel C++ Library and oneAPI Threading Building Blocks.
228+
#. To avoid naming device policy objects explicitly, add the ``–fsycl-unnamed-lambda`` option.
229+
230+
Below is an example of a command line used to compile code that contains Parallel STL algorithms on Linux (depending on the code, parameters within [] could be unnecessary):
231+
232+
.. code::
233+
234+
dpcpp [–fsycl-unnamed-lambda] test.cpp [-ltbb] -o test

0 commit comments

Comments
 (0)