Skip to content

Commit d1f056d

Browse files
committed
Update documentation with more extensive guides
1 parent 9abcd89 commit d1f056d

File tree

7 files changed

+614
-96
lines changed

7 files changed

+614
-96
lines changed

docs/guides.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,7 @@ Guides
55

66
guides/introduction.rst
77
guides/promotion.rst
8+
guides/memory.rst
89
guides/accuracy.rst
910
guides/constant.rst
11+
guides/example.md

docs/guides/accuracy.md

Lines changed: 93 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,112 @@
1-
Accuracy level
2-
===
1+
# Accuracy Level
32

4-
Many of the functions in Kernel Float take an additional `Accuracy` option as a template parameter.
5-
This option can be used to increase the performance of certain operations, at the cost of lower accuracy.
3+
For certain operations, there might be alternative versions available that provide better performance at the cost of lower accuracy.
4+
In other words, they are faster but also have a small error.
65

7-
There are four possible values for this parameter:
6+
## Fast Math
87

9-
* `accurate_policy`: Use the most accurate version of the function available.
10-
* `fast_policy`: Use the "fast math" version (for example, `__sinf` for sin on CUDA devices). Falls back to `accurate_policy` if such a version is not available.
11-
* `approx_policy<N>`: Rough approximation using a polynomial of degree `N`. Falls back to `fast_policy` if no such polynomial exists.
12-
* `default_policy`: Use a global default policy (see the next section).
8+
For several operations in single precision (float32), there are "fast math" versions. These functions are faster since they are hardware accelerated, but are not IEEE compliant for all inputs.
139

10+
To use this functionality, use the `fast_*` functions from Kernel Float.
1411

15-
For example, consider this code:
12+
```cpp
13+
kf::vec<float, 4> x = {1.0f, 2.0f, 3.0f, 4.0f};
14+
15+
// Sine
16+
kf::vec<float, 4> a = kf::fast_sin(x);
17+
18+
// Square root
19+
kf::vec<float, 4> b = kf::fast_sqrt(x);
20+
21+
// Reciprocal `1/x`
22+
kf::vec<float, 4> c = kf::fast_rcp(x);
23+
24+
// Division `a/b`
25+
kf::vec<float, 4> d = kf::fast_div(a, b);
26+
```
27+
28+
These functions are only functional for 32-bit and 16-bit floats.
29+
For other input types, the operation falls back to the regular version.
30+
31+
## Approximate Math
32+
33+
For 16-bit floats, several approximate functions are provided.
34+
These use approximations (typically low-degree polynomials) to calculate rough estimates of the functions.
35+
This can be very fast but also less accurate.
36+
37+
38+
To use this functionality, use the `approx_*` functions from Kernel Float. For other input types, the operation falls back to the `fast_*` variant.
39+
40+
```cpp
41+
kf::vec<half, 4> x = {1.0, 2.0, 3.0, 4.0};
42+
43+
// Sine
44+
kf::vec<half, 4> a = kf::approx_sin(x);
1645
17-
```C++
46+
// Square root
47+
kf::vec<half, 4> b = kf::approx_sqrt(x);
1848
19-
#include "kernel_float.h"
20-
namespace kf = kernel_float;
49+
// Reciprocal `1/x`
50+
kf::vec<half, 4> c = kf::approx_rcp(x);
2151
52+
// Division `a/b`
53+
kf::vec<half, 4> d = kf::approx_div(a, b);
54+
```
55+
56+
You can adjust the degree of approximation by supplying an integer template parameter:
57+
58+
59+
```cpp
60+
// Sine approximation with polynomial of degree 1
61+
kf::vec<half, 4> a = kf::approx_sin<1>(x);
62+
63+
// Polynomial of degree 2
64+
kf::vec<half, 4> a = kf::approx_sin<2>(x);
65+
66+
// Polynomial of degree 3
67+
kf::vec<half, 4> a = kf::approx_sin<3>(x);
68+
```
69+
70+
## Tuning Accuracy Level
71+
72+
Many functions in Kernel Float accept an additional Accuracy option as a template parameter.
73+
This allows you to tune the accuracy level without changing the function name.
74+
75+
There are four possible values for this parameter:
76+
77+
- `kf::accurate_policy`: Use the most accurate version of the function available.
78+
- `kf::fast_policy`: Use the "fast math" version.
79+
- `kf::approx_policy<N>`: Use the approximate version with degree `N`.
80+
- `kf::default_policy`: Use a global default policy (see the next section).
81+
82+
For example, consider this code:
2283

23-
int main() {
24-
kf::vec<float, 2> input = {1.0f, 2.0f};
84+
```cpp
85+
kf::vec<float, 2> input = {1.0f, 2.0f};
2586

26-
// Use the default policy
27-
kf::vec<float, 2> A = kf::cos(input);
87+
// Use the default policy
88+
kf::vec<float, 2> a = kf::cos(input);
2889

29-
// Use the most accuracy policy
30-
kf::vec<float, 2> B = kf::cos<kf::accurate_policy>(input);
90+
// Use the default policy
91+
kf::vec<float, 2> b = kf::cos<kf::default_policy>(input);
3192

32-
// Use the fastest policy
33-
kf::vec<float, 2> C = kf::cos<kf::fast_policy>(input);
93+
// Use the most accurate policy
94+
kf::vec<float, 2> c = kf::cos<kf::accurate_policy>(input);
3495

35-
printf("A = %f, %f", A[0], A[1]);
36-
printf("B = %f, %f", B[0], B[1]);
37-
printf("C = %f, %f", C[0], C[1]);
96+
// Use the fastest policy
97+
kf::vec<float, 2> d = kf::cos<kf::fast_policy>(input);
3898

39-
return EXIT_SUCCESS;
40-
}
99+
// Use the approximate policy
100+
kf::vec<float, 2> e = kf::cos<kf::approx_policy<3>>(input);
41101

102+
// You can use aliases to define your own policy
103+
using my_own_policy = kf::fast_policy;
104+
kf::vec<float, 2> f = kf::cos<my_own_policy>(input);
42105
```
43106
107+
## Setting `default_policy`
44108
45-
Setting `default_policy`
46-
---
47-
By default, the value for `default_policy` is `accurate_policy`.
109+
By default, `kf::default_policy` is set to `kf::accurate_policy`.
48110
49-
Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `fast_policy`.
111+
Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `kf::fast_policy`.
112+
This will use fast math for all functions and data types that support it.

docs/guides/example.md

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Full CUDA example
2+
3+
This page explains a CUDA program that estimates the value of pi using Kernel Float.
4+
5+
6+
## Overview
7+
8+
The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.
9+
10+
The kernel is shown below:
11+
12+
13+
```c++
14+
namespace kf = kernel_float;
15+
16+
using float_type = float;
17+
static constexpr int VECTOR_SIZE = 4;
18+
19+
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
20+
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
21+
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
22+
23+
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
24+
kf::vec<int, VECTOR_SIZE> yi = thread_y;
25+
26+
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
27+
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
28+
29+
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
30+
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
31+
32+
int n = kf::count(dist <= float_type(1));
33+
34+
if (n > 0) atomicAdd(global_count, n);
35+
}
36+
```
37+
38+
39+
## Code Explanation
40+
41+
Let's go through the code step by step.
42+
43+
```cpp
44+
// Alias `kernel_float` as `kf`
45+
namespace kf = kernel_float;
46+
```
47+
48+
This creates an alias for `kernel_float`.
49+
50+
```cpp
51+
// Define the float type and vector size
52+
using float_type = float;
53+
static constexpr int VECTOR_SIZE = 4;
54+
```
55+
56+
Define `float_type` as an alias for `float` to make it easy to change precision if needed.
57+
The vector size is set to 4, meaning each thread will process 4 data points.
58+
59+
```cpp
60+
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
61+
```
62+
63+
The CUDA kernel. There are `nx` points along the x axis and `ny` points along the y axis.
64+
65+
```cpp
66+
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
67+
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
68+
```
69+
70+
Compute the global x- and y-index of this thread.
71+
72+
```cpp
73+
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
74+
kf::vec<int, VECTOR_SIZE> yi = thread_y;
75+
```
76+
77+
Compute the points that this thread will process.
78+
The x coordinates start at `thread_x * VECTOR_SIZE` and then the vector `[0, 1, 2, ..., VECTOR_SIZE-1]`.
79+
The y coordinates are all `thread_y`.
80+
81+
```cpp
82+
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
83+
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
84+
```
85+
86+
Divide `xi` and `yi` by `nx` and `ny` to normalize them to `[0, 1]` range.
87+
88+
```cpp
89+
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
90+
```
91+
92+
Compute the squared distance from the origin (0, 0) to each point from `xf`,`yf`.
93+
94+
```cpp
95+
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
96+
```
97+
98+
Take the element-wise square root.
99+
100+
```cpp
101+
int n = kf::count(dist <= float_type(1));
102+
```
103+
104+
Count the number of points in the unit circle (i.e., for which the distance is less than 4).
105+
The expression `dist <= 1` returns a vector of booleans and `kf::count` counts the number of `true` values.
106+
107+
```cpp
108+
atomicAdd(global_count, n);
109+
```
110+
111+
Add `n` to the `global_count` variable.
112+
This must be done using an atomic operation since multiple thread will write this variable simultaneously.

0 commit comments

Comments
 (0)