Skip to content

openCL documetation

Fady Essam edited this page Jul 28, 2018 · 6 revisions

content:

opencl namespace

The opencl namespace is a namespace that allows boost.ublas to do computations on matrices and vectors on any device that supports opencl like GPUs, CPUs and FPGAs which allows for much greater performance for big data.

Why should I use it?

The usual ublas uses CPUs, CPUs operate on data sequentially in opposite to devices like GPUs for example (we will keep gpu as an example). Matrices operations like product (gemm which is the heart of deep learning, elemente-wise operations and other operations) and vectors operations also can be done in parallel between many cores.

Doing operations on the gpu using opencl requires some fixed overhead so it's not suitable for doing operations on small matrices.

But for big data (like for deep learning) the situation is totally different, because (take for example) if we want the product of 2 2000x2000 matrices it takes on cpu (on my device which is core i7 5500U) about 1000,000 ms so the 200ms overhead of opencl is negligible relatively and because the gpu has much more cores it took about 1200 ms on the gpu (including copying the data back to the host).

Here's a graph to give you an idea abut the difference between the time on cpu and on gpu vs the size of matrices (in multiplying two matrices) it was done comparing perfromance between intel i7 5500U (CPU) and AMD R5 255 (GPU)

Graph comparing cpu and gpu performance with matrix size

let's zoom out a bit to get an idea about how the performance scales with size in both cases

Graph comparing cpu and gpu performance with bigger matrix size

How to use it (Get started)

Dependencies

  1. opencl: you must have opencl sdk for the device you intend to run the opencl operations on

  2. clblas: you need to have the clblas library and build it on your sytem

How to setup the machine to enable the opencl ublas:

  • first you need to get the two dependencies decribed above (clBLAS & openCL)
    • openCL : download the opencl sdk provided by your vendor
    • clBLAS : download the library and use cmake to generate a vs solution (for example) with the options you need then build it for your device
  • you need to set their paths up in the boost configuration file as follows:
      using opencl : : <include>path/to/cl.h <search>path/to/openclLibrary ;
    
      using clblas : : <include>path/to/clblas.h <search>path/to/clblasLibrary ;
    
    

How to enable the opencl library functions in code:

  1. before including <boost/numeric/ublas/matrix.hpp> you must use "#define BOOST_UBLAS_ENABLE_OPENCL" to enable including opencl and clblas libraries in the matrix.hpp file

  2. at the begining of your code you should declare boost::numeric::ublas::opencl::library lib;

    to gain its contructor and destructor which initialize the clblas library

  3. determine which device you want to use and use its context and command queue at the operations like:

     compute::device device = compute::system::devices().at(DEVICE_NUMBER_ON_THE_SYSTEM);
    
     compute::context context(device);
    
     compute::command_queue queue(context, device);
  4. congrats πŸ˜„ you got the opencl operations working

note : this all might be unclear now, but refer to the tutorials below to get a clear understanding

How to run benchmarks & generate a similar graph for your device vs cpu

  • make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
  • go to folder benchmarks and run the jamfile to build all the source file
  • note: if you want the benchmarked sizes, just open the operation's source file and edit the vector 'times' with the sizes you want
  • note: the sizes in case of matrix operation mean matrix(size, size) , but in case of vector it means vector(size)
  • run the operation(s) you need to plot
  • each operation you run will produce a file contaning its benchmarking data
  • run the plot.py and pass the benchmarking-data-files paths as arguments to the file like you can pass as many files as you like
python plot.py path/to/file1 path/to/file2
  • you got your graph!

How to run opencl testing

  • make sure you have set the clBLAS and OpenCL paths in the configuration file of boost to be able to build the opencl benchmarking
  • run the b2 in the testing folder to build tests and they will be built and run by default with the rest of the tests

Architecture

the opencl namespace consists of 2 files

  1. "opencl_core.hpp" which has the classes of the namespace (described in details later)
  2. "operations.hpp" which has the functions that do operations on the matrices like prod()

Classes

  • boost::numeric::ublas::opencl::storage

    it is used as a tag to tell easily that the data of this matrix resides on a device that supports opencl

  • boost::numeric::ublas::matrix<T, L, opencl::storage>

    it is a special case of the boost::numeric::ublas::matrix<T, L,A> class which indicates that the data of this matrix is not on the host, but on a device that supports opencl.

it supports some functions like:

void from_host(boost::numeric::ublas::matrix<T,L,A>& m , boost::compute::command_queue& queue)

which takes a matrix already on the host and copies it to this matrix on device using the command queue sent as a parameter and its device (matrix on the device must have the same size1 and size2)

void to_host(boost::numeric::ublas::matrix<T,L,A>& m, boost::compute::command_queue& queue)

which takes a matrix with the same size as the matrix on the device and copies the content from device to host

  • boost::numeric::ublas::vector<T, opencl::storage>

    it is a special case of boost::numeric::ublas::vector that works as a container for vectors on an opencl device and implements the same api as boost::numeric::ublas::matrix<T, L, opencl::storage>

Operations

note: All openCL operations are smart enough to work with any combination of row_major or column_major matrices

for operations (almost all) operations implements this api which has three overloaded functions

  1. takes 2 matrices (or vectors) already on an opencl device and outputs a matrix that is still on the same device
  2. takes two matrices (or vectors) on host and copies them to device and then do the operation and copy the result back to host
  3. as (2) but returns the result as a return value

** Here's prod function api described in details **

  • ublas::matrix<T, F, A> prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, boost::compute::command_queue& queue)

    it takes two matrices that are originally not on the gpu and moves them to the gpu multiplies them on the
    queue and return the results

  • void prod(ublas::matrix<T, F, A>& a, ublas::matrix<T, F, A>& b, ublas::matrix<T, F, A>& result, boost::compute::command_queue& queue)

    does the same as the previous function but takes a refrence to the result matrix as input and puts the result values in it

  • void prod(ublas::matrix<T, F, opencl::storage>& a, ublas::matrix<T, F, opencl::storage>& b, ublas::matrix<T, F, opencl::storage>& result, boost::compute::command_queue& queue)

    does the same as the previous function but the 3 matrices a , b & result are all not on the host, but all of them are on the same device and the queue is of the same device too (it doesn't involve copying data to or from host, so it's much faster)

Supported operation

operations uBLAS ublas::opencl support clBLAS support
prod (matrix-matrix) βœ”οΈ βœ”οΈ βœ”οΈ
prod (matrix-vector) βœ”οΈ βœ”οΈ βœ”οΈ
prod (vector-matrix) βœ”οΈ βœ”οΈ βœ”οΈ
inner_prod βœ”οΈ βœ”οΈ βœ”οΈ
outer_prod βœ”οΈ βœ”οΈ βœ”οΈ
trans βœ”οΈ βœ”οΈ ❌
swap βœ”οΈ βœ”οΈ βœ”οΈ
element_prod βœ”οΈ βœ”οΈ ❌
element_div βœ”οΈ βœ”οΈ ❌
operator + (matrix-matrix) βœ”οΈ βœ”οΈ (as element_add) ❌
operator + (vector-vector) βœ”οΈ βœ”οΈ (as element_add) ❌
operator + (matrix-constant) ❌ βœ”οΈ (as element_add) ❌
operator + (vector-constant) ❌ βœ”οΈ (as element_add) ❌
operator - (matrix-matrix) βœ”οΈ βœ”οΈ (as element_add) ❌
operator - (vector-vector) βœ”οΈ βœ”οΈ (as element_add) ❌
operator - (matrix-constant) ❌ βœ”οΈ (as element_sub) ❌
operator - (vector-constant) ❌ βœ”οΈ (as element_sub) ❌
element_scale (matrix-constant) ❌ βœ”οΈ (called element_scale and not element_prod because for complex numbers result.real = m.real * constant.real , result(i,j).imag = m(i,j).imag * constant.imag) ❌
element_scale (vector-constant) ❌ βœ”οΈ (called element_scale and not element_prod because for complex numbers result.real = v.real * constant.real , result[i].imag = v[i].imag * constant.imag) ❌
construct plane rotation ❌ ❌ βœ”οΈ
apply given rotation ⚠️ It is supported through some consecutive operations same as uBLAS βœ”οΈ
norm1 βœ”οΈ βœ”οΈ (for vectors of double and float) ⚠️ only absolute sum of values is supported (not norm_1 in case of complex numbers)
norm2 βœ”οΈ βœ”οΈ βœ”οΈ

also any element wise operator is supported in ublas::opencl through element_wise function

Example projects

1. using the openCL operations with copying data to gpu and copying back from it

//enable including "opencl_core.hpp" and "operations.hpp" (must  be done before including matrix.hpp to get the opencl functionality
#define BOOST_UBLAS_ENABLE_OPENCL 
#include <boost/numeric/ublas/matrix.hpp>

namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;

int main()
{
  opencl::library lib; //to initialize the opencl api

  // choose the device you want to operate on and get its context and queue
  compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
  compute::context context(device);
  compute::command_queue queue(context, device);


  ublas::matrix<float> a(500, 500, 100); //initialize it with any value (100 for example)
  ublas::matrix<float> b(500, 500, 100); //initialize it with any value (100 for example)


  ublas::matrix<float> result = opencl::prod(a, b, queue); //pass the command_queue you want to execute the operation on its device

}

2. using the openCL operations without copying data back and forth (with copying data only once to opencl device and then keep it on it to do multiple opertaions on them)

//enable including "opencl_core.hpp" and "operations.hpp" (must  be done before including matrix.hpp to get the opencl functionality
#define BOOST_UBLAS_ENABLE_OPENCL
#include <boost/numeric/ublas/matrix.hpp>

namespace ublas = boost::numeric::ublas;
namespace opencl = boost::numeric::ublas::opencl;
namespace compute = boost::compute;
typedef ublas::matrix<float, ublas::basic_row_major<>, opencl::storage> device_matrix;
typedef ublas::matrix<float> host_matrix;

int main()
{
  opencl::library lib; //to initialize the opencl api

  // choose the device you want to operate on and get its context and queue
  compute::device device = compute::system::devices().at(1); //change 1 to the device number you want or use default_device()
  compute::context context(device);
  compute::command_queue queue(context, device);


  host_matrix a(500, 500, 100); //initialize it with any value (100 for example)
  device_matrix a_device(a, queue); //queue is the command_queue that does the copying

  host_matrix b(500, 500, 100); //initialize it with any value (100 for example)
  device_matrix b_device(b, queue); //queue is the command_queue that does the copying

  //initialize result matrices on device to hold the result
  device_matrix result_prod_device(500, 500, context);
  device_matrix result_element_prod_device(500, 500, context);


  //note that no data copying from or to device happen here
  //so you can do multiple operations without copying back and forth
  opencl::prod(a_device, b_device, result_prod_device, queue); //pass the command_queue you want to execute the operation on its device
  opencl::element_prod(a_device, b_device, result_element_prod_device, queue); //pass the command_queue you want to execute the operation on its device



  //if you want to get the data in  host matrix
  host_matrix result_prod_host(500, 500);

  result_prod_device.to_host(result_prod_host, queue);

}