-
Notifications
You must be signed in to change notification settings - Fork 0
openCL documetation
The opencl namespace is a namespace that allows boost.ublas to do computations on matrices and vectors on any device that supports opencl like GPUs, CPUs and FPGAs which allows for much greater performance for big data.
The usual ublas uses CPUs, CPUs operate on data sequentially in opposite to devices like GPUs for example (we will keep gpu as an example). Matrices operations like product (gemm which is the heart of deep learning, elemente-wise operations and other operations) and vectors operations also can be done in parallel between many cores.
Doing operations on the gpu using opencl requires some fixed overhead so it's not suitable for doing operations on small matrices.
But for big data (like for deep learning) the situation is totally different, because (take for example) if we want the product of 2 2000x2000 matrices it takes on cpu (on my device which is core i7 5500U) about 1000,000 ms so the 200ms overhead of opencl is negligible relatively and because the gpu has much more cores it took about 1200 ms on the gpu (including copying the data back to the host).
Here's a graph to give you an idea abut the difference between the time on cpu and on gpu vs the size of matrices (in multiplying two matrices) it was done comparing perfromance between intel i7 5500U (CPU) and AMD R5 255 (GPU)
let's zoom out a bit to get an idea about how the performance scales with size in both cases
-
opencl: you must have opencl sdk for the device you intend to run the opencl operations on
-
clblas: you need to have the clblas library and build it on your sytem
- first you need to get the two dependencies decribed above (clBLAS & openCL)
- openCL : download the opencl sdk provided by your vendor
- clBLAS : download the library and use cmake to generate a vs solution (for example) with the options you need then build it for your device
- you need to set their paths up in the boost configuration file as follows:
using opencl : : <include>path/to/cl.h <search>path/to/openclLibrary ; using clblas : : <include>path/to/clblas.h <search>path/to/clblasLibrary ;
-
before including <boost/numeric/ublas/matrix.hpp> you must use "#define BOOST_UBLAS_ENABLE_OPENCL" to enable including opencl and clblas libraries in the matrix.hpp file
-
at the begining of your code you should declare boost::numeric::ublas::opencl::library lib;
to gain its contructor and destructor which initialize the clblas library
-
determine which device you want to use and use its context and command queue at the operations like:
compute::device device = compute::system::devices().at(DEVICE_NUMBER_ON_THE_SYSTEM); compute::context context(device); compute::command_queue queue(context, device);
-
congrats π you got the opencl operations working
note : this all might be unclear now, but refer to the tutorials below to get a clear understanding
- make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
- go to folder benchmarks and run the jamfile to build all the source file
- note: if you want the benchmarked sizes, just open the operation's source file and edit the vector 'times' with the sizes you want
- note: the sizes in case of matrix operation mean matrix(size, size) , but in case of vector it means vector(size)
- run the operation(s) you need to plot
- each operation you run will produce a file contaning its benchmarking data
- run the plot.py and pass the benchmarking-data-files paths as arguments to the file like you can pass as many files as you like
python plot.py path/to/file1 path/to/file2
- you got your graph!
- make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
- run the b2 in the testing folder to build tests and they will be built and run by default with the rest of the tests
the opencl namespace consists of 2 files
- "opencl_core.hpp" which has the classes of the namespace (described in details later)
- "operations.hpp" which has the functions that do operations on the matrices like prod()
-
boost::numeric::ublas::opencl::storage
it is used as a tag to tell easily that the data of this matrix resides on a device that supports opencl
-
boost::numeric::ublas::matrix<T, L, opencl::storage>
it is a special case of the boost::numeric::ublas::matrix<T, L,A> class which indicates that the data of this matrix is not on the host, but on a device that supports opencl.
it supports some functions like:
void from_host(boost::numeric::ublas::matrix<T,L,A>& m , boost::compute::command_queue& queue)
which takes a matrix already on the host and copies it to this matrix on device using the command queue sent as a parameter and its device (matrix on the device must have the same size1 and size2)
void to_host(boost::numeric::ublas::matrix<T,L,A>& m, boost::compute::command_queue& queue)
which takes a matrix with the same size as the matrix on the device and copies the content from device to host
-
boost::numeric::ublas::vector<T, opencl::storage>
it is a special case of boost::numeric::ublas::vector that works as a container for vectors on an opencl device and implements the same api as boost::numeric::ublas::matrix<T, L, opencl::storage>
note: All openCL operations are smart enough to work with any combination of row_major or column_major matrices
- takes 2 matrices (or vectors) already on an opencl device and outputs a matrix that is still on the same device
- takes two matrices (or vectors) on host and copies them to device and then do the operation and copy the result back to host
- as (2) but returns the result as a return value
** Here's prod function api described in details **
-
ublas::matrix<T, F, A> prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, boost::compute::command_queue& queue)
it takes two matrices that are originally not on the gpu and moves them to the gpu multiplies them on the
queue and return the results -
void prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, ublas::matrix<T, F, A>& result, boost::compute::command_queue& queue)
does the same as the previous function but takes a refrence to the result matrix as input and puts the result values in it
-
void prod(ublas::matrix<T, F, opencl::storage>& a, ublas::matrix<T, F, opencl::storage>& b, ublas::matrix<T, F, opencl::storage>& result, boost::compute::command_queue& queue)
does the same as the previous function but the 3 matrices a , b & result are all not on the host, but all of them are on the same device and the queue is of the same device too (it doesn't involve copying data to or from host, so it's much faster)
| operations | uBLAS | ublas::opencl support | clBLAS support |
|---|---|---|---|
| prod (matrix-matrix) | βοΈ | βοΈ | βοΈ |
| prod (matrix-vector) | βοΈ | βοΈ | βοΈ |
| prod (vector-matrix) | βοΈ | βοΈ | βοΈ |
| inner_prod | βοΈ | βοΈ | βοΈ |
| outer_prod | βοΈ | βοΈ | βοΈ |
| trans | βοΈ | βοΈ | β |
| swap | βοΈ | βοΈ | βοΈ |
| element_prod | βοΈ | βοΈ | β |
| element_div | βοΈ | βοΈ | β |
| operator + (matrix-matrix) | βοΈ | βοΈ (as element_add) | β |
| operator + (vector-vector) | βοΈ | βοΈ (as element_add) | β |
| operator + (matrix-constant) | β | βοΈ (as element_add) | β |
| operator + (vector-constant) | β | βοΈ (as element_add) | β |
| operator - (matrix-matrix) | βοΈ | βοΈ (as element_add) | β |
| operator - (vector-vector) | βοΈ | βοΈ (as element_add) | β |
| operator - (matrix-constant) | β | βοΈ (as element_sub) | β |
| operator - (vector-constant) | β | βοΈ (as element_sub) | β |
| element_scale (matrix-constant) | β | βοΈ (called element_scale and not element_prod because for complex numbers result.real = m.real * constant.real , result(i,j).imag = m(i,j).imag * constant.imag) | β |
| element_scale (vector-constant) | β | βοΈ (called element_scale and not element_prod because for complex numbers result.real = v.real * constant.real , result[i].imag = v[i].imag * constant.imag) | β |
| construct plane rotation | β | β | βοΈ |
| apply given rotation | same as uBLAS | βοΈ | |
| norm1 | βοΈ | βοΈ (for vectors of double and float) | |
| norm2 | βοΈ | βοΈ | βοΈ |
also any element wise operator is supported in ublas::opencl through element_wise function
//enable including "opencl_core.hpp" and "operations.hpp" (must be done before including matrix.hpp to get the opencl functionality
#define BOOST_UBLAS_ENABLE_OPENCL
#include <boost/numeric/ublas/matrix.hpp>
namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;
int main()
{
opencl::library lib; //to initialize the opencl api
// choose the device you want to operate on and get its context and queue
compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
compute::context context(device);
compute::command_queue queue(context, device);
ublas::matrix<float> a(500, 500, 100); //initialize it with any value (100 for example)
ublas::matrix<float> b(500, 500, 100); //initialize it with any value (100 for example)
ublas::matrix<float> result = opencl::prod(a, b, queue); //pass the command_queue you want to execute the operation on its device
}2. using the openCL operations without copying data back and forth (with copying data only once to opencl device and then keep it on it to do multiple opertaions on them)
//enable including "opencl_core.hpp" and "operations.hpp" (must be done before including matrix.hpp to get the opencl functionality
#define BOOST_UBLAS_ENABLE_OPENCL
#include <boost/numeric/ublas/matrix.hpp>
namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;
typedef ublas::matrix<float, ublas::basic_row_major<>, opencl::storage> device_matrix;
typedef ublas::matrix<float> host_matrix;
int main()
{
opencl::library lib; //to initialize the opencl api
// choose the device you want to operate on and get its context and queue
compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
compute::context context(device);
compute::command_queue queue(context, device);
host_matrix a(500, 500, 100); //initialize it with any value (100 for example)
device_matrix a_device(a, queue); //queue is the command_queue that does the copying
host_matrix b(500, 500, 100); //initialize it with any value (100 for example)
device_matrix b_device(b, queue); //queue is the command_queue that does the copying
//initialize result matrices on device to hold the result
device_matrix result_prod_device(500, 500, context);
device_matrix result_element_prod_device(500, 500, context);
//note that no data copying from or to device happen here
//so you can do multiple operations without copying back and forth
opencl::prod(a_device, b_device, result_prod_device, queue); //pass the command_queue you want to execute the operation on its device
opencl::element_prod(a_device, b_device, result_element_prod_device, queue); //pass the command_queue you want to execute the operation on its device
//if you want to get the data in host matrix
host_matrix result_prod_host(500, 500);
result_prod_device.to_host(result_prod_host, queue);
}
