KernelTuner
diff --git a/‎docs/guides.rst
Lines changed: 2 additions & 0 deletions b/‎docs/guides.rst
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/guides/accuracy.md
Lines changed: 93 additions & 30 deletions b/‎docs/guides/accuracy.md
Lines changed: 93 additions & 30 deletions
diff --git a/‎docs/guides/example.md
Lines changed: 112 additions & 0 deletions b/‎docs/guides/example.md
Lines changed: 112 additions & 0 deletions
@@ -5,5 +5,7 @@ Guides
 
    guides/introduction.rst
    guides/promotion.rst
+   guides/memory.rst
    guides/accuracy.rst
    guides/constant.rst
+   guides/example.md
@@ -1,49 +1,112 @@
-Accuracy level
-===
+# Accuracy Level
 
-Many of the functions in Kernel Float take an additional `Accuracy` option as a template parameter.
-This option can be used to increase the performance of certain operations, at the cost of lower accuracy.
+For certain operations, there might be alternative versions available that provide better performance at the cost of lower accuracy.
+In other words, they are faster but also have a small error.
 
-There are four possible values for this parameter:
+## Fast Math
 
-* `accurate_policy`: Use the most accurate version of the function available.
-* `fast_policy`: Use the "fast math" version (for example, `__sinf` for sin on CUDA devices). Falls back to `accurate_policy` if such a version is not available.
-* `approx_policy<N>`: Rough approximation using a polynomial of degree `N`. Falls back to `fast_policy` if no such polynomial exists.
-* `default_policy`: Use a global default policy (see the next section).
+For several operations in single precision (float32), there are "fast math" versions. These functions are faster since they are hardware accelerated, but are not IEEE compliant for all inputs.
 
+To use this functionality, use the `fast_*` functions from Kernel Float.
 
-For example, consider this code:
+```cpp
+kf::vec<float, 4> x = {1.0f, 2.0f, 3.0f, 4.0f};
+
+// Sine
+kf::vec<float, 4> a = kf::fast_sin(x);
+
+// Square root
+kf::vec<float, 4> b = kf::fast_sqrt(x);
+
+// Reciprocal `1/x`
+kf::vec<float, 4> c = kf::fast_rcp(x);
+
+// Division `a/b`
+kf::vec<float, 4> d = kf::fast_div(a, b);
+```
+
+These functions are only functional for 32-bit and 16-bit floats. 
+For other input types, the operation falls back to the regular version.
+
+## Approximate Math
+
+For 16-bit floats, several approximate functions are provided. 
+These use approximations (typically low-degree polynomials) to calculate rough estimates of the functions. 
+This can be very fast but also less accurate.
+
+
+To use this functionality, use the `approx_*` functions from Kernel Float. For other input types, the operation falls back to the `fast_*` variant.
+
+```cpp
+kf::vec<half, 4> x = {1.0, 2.0, 3.0, 4.0};
+
+// Sine
+kf::vec<half, 4> a = kf::approx_sin(x);
 
-```C++
+// Square root
+kf::vec<half, 4> b = kf::approx_sqrt(x);
 
-#include "kernel_float.h"
-namespace kf = kernel_float;
+// Reciprocal `1/x`
+kf::vec<half, 4> c = kf::approx_rcp(x);
 
+// Division `a/b`
+kf::vec<half, 4> d = kf::approx_div(a, b);
+```
+
+You can adjust the degree of approximation by supplying an integer template parameter:
+
+
+```cpp
+// Sine approximation with polynomial of degree 1
+kf::vec<half, 4> a = kf::approx_sin<1>(x);
+
+// Polynomial of degree 2
+kf::vec<half, 4> a = kf::approx_sin<2>(x);
+
+// Polynomial of degree 3
+kf::vec<half, 4> a = kf::approx_sin<3>(x);
+```
+
+## Tuning Accuracy Level
+
+Many functions in Kernel Float accept an additional Accuracy option as a template parameter. 
+This allows you to tune the accuracy level without changing the function name.
+
+There are four possible values for this parameter:
+
+- `kf::accurate_policy`: Use the most accurate version of the function available.
+- `kf::fast_policy`: Use the "fast math" version.
+- `kf::approx_policy<N>`: Use the approximate version with degree `N`.
+- `kf::default_policy`: Use a global default policy (see the next section).
+
+For example, consider this code:
 
-int main() {
-  kf::vec<float, 2> input = {1.0f, 2.0f};
+```cpp
+kf::vec<float, 2> input = {1.0f, 2.0f};
 
-  // Use the default policy
-  kf::vec<float, 2> A = kf::cos(input);
+// Use the default policy
+kf::vec<float, 2> a = kf::cos(input);
 
-  // Use the most accuracy policy
-  kf::vec<float, 2> B = kf::cos<kf::accurate_policy>(input);
+// Use the default policy
+kf::vec<float, 2> b = kf::cos<kf::default_policy>(input);
 
-  // Use the fastest policy
-  kf::vec<float, 2> C = kf::cos<kf::fast_policy>(input);
+// Use the most accurate policy
+kf::vec<float, 2> c = kf::cos<kf::accurate_policy>(input);
 
-  printf("A = %f, %f", A[0], A[1]);
-  printf("B = %f, %f", B[0], B[1]);
-  printf("C = %f, %f", C[0], C[1]);
+// Use the fastest policy
+kf::vec<float, 2> d = kf::cos<kf::fast_policy>(input);
 
-  return EXIT_SUCCESS;
-}
+// Use the approximate policy
+kf::vec<float, 2> e = kf::cos<kf::approx_policy<3>>(input);
 
+// You can use aliases to define your own policy
+using my_own_policy = kf::fast_policy;
+kf::vec<float, 2> f = kf::cos<my_own_policy>(input);
 ```
 
+## Setting `default_policy`
 
-Setting `default_policy`
----
-By default, the value for `default_policy` is `accurate_policy`.
+By default, `kf::default_policy` is set to `kf::accurate_policy`.
 
-Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `fast_policy`.
+Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `kf::fast_policy`.
+This will use fast math for all functions and data types that support it.
@@ -0,0 +1,112 @@
+# Full CUDA example
+
+This page explains a CUDA program that estimates the value of pi using Kernel Float.
+
+
+## Overview
+
+The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.
+
+The kernel is shown below:
+
+
+```c++
+namespace kf = kernel_float;
+
+using float_type = float;
+static constexpr int VECTOR_SIZE = 4;
+
+__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
+    int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
+    int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
+
+    kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
+    kf::vec<int, VECTOR_SIZE> yi = thread_y;
+
+    kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
+    kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
+
+    kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
+    kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
+
+    int n = kf::count(dist <= float_type(1));
+
+    if (n > 0) atomicAdd(global_count, n);
+}
+```
+
+
+## Code Explanation
+
+Let's go through the code step by step.
+
+```cpp
+// Alias `kernel_float` as `kf`
+namespace kf = kernel_float;
+```
+
+This creates an alias for `kernel_float`.
+
+```cpp
+// Define the float type and vector size
+using float_type = float;
+static constexpr int VECTOR_SIZE = 4;
+```
+
+Define `float_type` as an alias for `float` to make it easy to change precision if needed.
+The vector size is set to 4, meaning each thread will process 4 data points.
+
+```cpp
+__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
+```
+
+The CUDA kernel. There are `nx` points along the x axis and `ny` points along the y axis.
+
+```cpp
+    int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
+    int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
+```
+
+Compute the global x- and y-index of this thread.
+
+```cpp
+    kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
+    kf::vec<int, VECTOR_SIZE> yi = thread_y;
+```
+
+Compute the points that this thread will process.
+The x coordinates start at `thread_x * VECTOR_SIZE` and then the vector `[0, 1, 2, ..., VECTOR_SIZE-1]`.
+The y coordinates are all `thread_y`.
+
+```cpp
+    kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
+    kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
+```
+
+Divide `xi` and `yi` by `nx` and `ny` to normalize them to `[0, 1]` range.
+
+```cpp
+    kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
+```
+
+Compute the squared distance from the origin (0, 0) to each point from `xf`,`yf`.
+
+```cpp
+    kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
+```
+
+Take the element-wise square root.
+
+```cpp
+    int n = kf::count(dist <= float_type(1));
+```
+
+Count the number of points in the unit circle (i.e., for which the distance is less than 4).
+The expression `dist <= 1` returns a vector of booleans and `kf::count` counts the number of `true` values.
+
+```cpp
+atomicAdd(global_count, n);
+```
+
+Add `n` to the `global_count` variable.
+This must be done using an atomic operation since multiple thread will write this variable simultaneously.