gjbex · May 3, 2022
diff --git a/‎python_for_hpc.pptx
131 KB b/‎python_for_hpc.pptx
131 KB
diff --git a/‎source-code/README.md
+6 b/‎source-code/README.md
+6
diff --git a/‎source-code/gpu/README.md
+14 b/‎source-code/gpu/README.md
+14
diff --git a/‎source-code/gpu/curand.ipynb
+199 b/‎source-code/gpu/curand.ipynb
+199
diff --git a/‎source-code/gpu/numba.ipynb
+1,583 b/‎source-code/gpu/numba.ipynb
+1,583
diff --git a/‎source-code/gpu/pycuda.ipynb
100755100644
+1,112-941 b/‎source-code/gpu/pycuda.ipynb
100755100644
+1,112-941
diff --git a/‎source-code/gpu/scikit_cuda.ipynb
+877 b/‎source-code/gpu/scikit_cuda.ipynb
+877
diff --git a/‎source-code/mpi4py/halo.py
+2-2 b/‎source-code/mpi4py/halo.py
+2-2
diff --git a/‎source-code/numba/README.md
+3 b/‎source-code/numba/README.md
+3
diff --git a/‎source-code/numba/computing_pi.ipynb
+176 b/‎source-code/numba/computing_pi.ipynb
+176
diff --git a/‎source-code/numba/numba_parallel.ipynb
+438 b/‎source-code/numba/numba_parallel.ipynb
+438
diff --git a/‎source-code/performance/README.md
+10 b/‎source-code/performance/README.md
+10
diff --git a/‎source-code/performance/number_puzzle.ipynb
+1,474 b/‎source-code/performance/number_puzzle.ipynb
+1,474
@@ -3,6 +3,7 @@
 This is source code that is either used in the presentation, or was developed
 to create it.  There is some material not covered in the presentation as well.
 
+
 ## Requirements
 
 * Python version: at least 3.6
@@ -18,6 +19,10 @@ to create it.  There is some material not covered in the presentation as well.
   * jupyter
   * ipywidgets
 
+* For the GPU code:
+  * pycuda
+  * scikit-cuda
+
 
 ## What is it?
 
@@ -39,3 +44,4 @@ to create it.  There is some material not covered in the presentation as well.
 1. `pypy`: code to experiment with the Pypy interpreter.
 1. `file-formats`: influcence of file formats on performance.
 1. `gpu`: some examples of using GPUs.
+1. `performance`: general considerations about performance.
@@ -0,0 +1,14 @@
+# GPU
+
+Sample code for performing computations on a GPU.
+
+
+## What is it?
+
+1. `pycuda.ipynb`: jupyter notebook illustrating pyCUDA.
+1. `curand.ipynb`: jupyter notebook illustrating generating random
+   numbers on a GPU.
+1. `scikit_cuda.ipynb`: jupyter notebook illustrating linear algebra
+   on a GPU device.
+1. `numba.ipynb`: jupyter notebook illustrating using numba for
+   GPU computing.
@@ -0,0 +1,199 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2a4181f6-f103-4025-a45c-5bd31bbc3d12",
+   "metadata": {},
+   "source": [
+    "# Requirements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "eac06215-e0dd-430b-a157-a5aab4904b65",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "import numpy as np\n",
+    "import pycuda.autoinit\n",
+    "from pycuda import gpuarray\n",
+    "from pycuda.compiler import SourceModule"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5832d66d-0446-499c-a17e-ffcac9bdbd0a",
+   "metadata": {},
+   "source": [
+    "We determine $\\pi$ as the ratio between a circle of radius 1 and the square that circumscribes it. The area of the circle will be approximated by the number of randomly selected points that fall into it, compared to the (larger) number of points that fall into the square.\n",
+    "\n",
+    "If we choose $x$ and $y$ independently from a uniform distribution $[0, 1[$, then $(x, y)$ represents a point and lies in a circle with radius 1 if $x^2 + y^2 \\le 1$.  Since this is only one quarter of the circle and circumscribing square, we get $\\pi$ by dividing the number of points in the circle by the total number of points, and multiplying by 4."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32ca5892-664f-4317-a41d-b4f7a4d3f156",
+   "metadata": {},
+   "source": [
+    "# Implementation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bc4dbd6-5018-427d-9ee9-347bfe6cd9d9",
+   "metadata": {},
+   "source": [
+    "To generate random numbers for each thread on the GPU, we use the curand library, which is a C++ library.  Since we use plain C for our CUDA code, we have to make sure that the header file is read ouside of an extermal C block.  The `SourceModule` add this automatically by default, so we make sure this isn't done by specifying the appropriate option.  In the source code, we implement our kernel in an external C blcok.\n",
+    "\n",
+    "The random number generator is intialized using the `curand_init` function that takes a seed as its first argument.  We ensure that it is unique for each thread by adding a thread-specific constant to the clock time.\n",
+    "\n",
+    "Random numbers are sampled from a uniform distribution using the `curand_uniform` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "122a78d1-9d07-4484-8f30-a01ed58a9715",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "source_code = '''\n",
+    "    #include <curand_kernel.h>\n",
+    "    \n",
+    "    typedef unsigned long long cu_long;\n",
+    "    \n",
+    "    extern \"C\" {\n",
+    "        __global__ void estimate_pi(cu_long nr_tries, cu_long *nr_hits) {\n",
+    "            curandState rand_state;\n",
+    "            int thread_id = blockIdx.x*blockDim.x + threadIdx.x;\n",
+    "            curand_init((cu_long) clock() + (cu_long) thread_id,\n",
+    "                        (cu_long) 0, (cu_long) 0, &rand_state);\n",
+    "            float x, y;\n",
+    "            for (cu_long i = 0; i < nr_tries; ++i) {\n",
+    "                x = curand_uniform(&rand_state);\n",
+    "                y = curand_uniform(&rand_state);\n",
+    "                if (x*x + y*y < 1.0f) {\n",
+    "                    nr_hits[thread_id]++;\n",
+    "                }\n",
+    "            }\n",
+    "        }\n",
+    "    }\n",
+    "'''\n",
+    "\n",
+    "kernels = SourceModule(no_extern_c=True, source=source_code)\n",
+    "pi_kernel = kernels.get_function('estimate_pi')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e54d0d19-9545-46b6-b79f-5a3a110e7a66",
+   "metadata": {},
+   "source": [
+    "Set the number of threads per block and the number of blocks per grid, and create an array of the appropriate size to store the counts for each thread.  Also specify the number of points to try"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "71134a08-5195-4205-ae1c-51c7298704e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "threads_per_block = 32\n",
+    "blocks_per_grid = 512\n",
+    "total_threads = threads_per_block*blocks_per_grid\n",
+    "nr_hits = gpuarray.zeros((total_threads, ), dtype=np.uint64)\n",
+    "nr_tries = np.uint64(2**24)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84c1a177-7a4f-4b27-9a11-4b318601ac4d",
+   "metadata": {},
+   "source": [
+    "Now we can execute the kernel and compute the value of $\\pi$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "a872e28c-9256-4d50-b46c-f8381131eda1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pi_kernel(nr_tries, nr_hits, grid=(blocks_per_grid, 1, 1), block=(threads_per_block, 1, 1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "53ddbb43-0f22-427e-83a6-c6d0264e958e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pi_computed = 4.0*np.sum(nr_hits.get())/(nr_tries*total_threads)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19bbb167-50b8-4bdc-8c92-1fb2f69d1e1b",
+   "metadata": {},
+   "source": [
+    "Checking the accuracy as comared with $\\pi$'s true value shows that it is correct upto a millionth."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "f86f0258-64c8-44a0-b8b3-5a6f5d24e362",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.0e-01 True\n",
+      "1.0e-02 True\n",
+      "1.0e-03 True\n",
+      "1.0e-04 True\n",
+      "1.0e-05 True\n",
+      "1.0e-06 True\n",
+      "1.0e-07 False\n",
+      "1.0e-08 False\n",
+      "1.0e-09 False\n",
+      "1.0e-10 False\n",
+      "1.0e-11 False\n",
+      "1.0e-12 False\n"
+     ]
+    }
+   ],
+   "source": [
+    "for tol in np.logspace(-1, -12, num=12):\n",
+    "    print(f'{tol:.1e} {math.isclose(pi_computed, math.pi, rel_tol=tol)}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -65,9 +65,9 @@ def halos_to_str(left, upper, right, lower):
         print(f'rank {rank}, left = {left}, up = {up}, right = {right}, down = {down}')
     # note that the column data is stored non-contiguously, and has to be copied
     # before sending it
-    send_buffer = np.array(matrix[:, 0])
+    send_buffer = np.array(matrix[:, 0]).copy()
     cart_comm.Sendrecv(send_buffer, left, recvbuf=right_halo, source=right)
-    send_buffer = np.array(matrix[:, -1])
+    send_buffer = np.array(matrix[:, -1]).copy()
     cart_comm.Sendrecv(send_buffer, right, recvbuf=left_halo, source=left)
     # row data is contiguous and can be used as a sendbuffer without copying
     cart_comm.Sendrecv(matrix[0, :], up, recvbuf=lower_halo, source=down)
 
@@ -12,3 +12,6 @@ can be obtained without much effort.
 1. `Primes`: code to compute the first n prime numbers comparing a pure Python
     implementation with numba JIT and eager JIT.
 1. `Ufunc`: defining a numpy ufunc using numba.
+1. `numba_parallel.ipynb`: jupyter notebook experimenting with numba's
+   parallel capabilities.
+1. `computing_pi.ipynb`: jupyter notebook illustrating speedup by numba.
@@ -0,0 +1,176 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "dad1f0e2-bc13-4d67-b40e-e5f8946e7e6f",
+   "metadata": {},
+   "source": [
+    "# Requirements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "694413b5-380c-4239-8bce-09b90df7fe79",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from numba import njit\n",
+    "import numpy as np\n",
+    "import random"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "aee5369d-7b69-4fb4-8567-52bd8e92571b",
+   "metadata": {},
+   "source": [
+    "# Random $\\pi$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "70202ae4-ad82-4c91-aea0-a3aeccfb7bdc",
+   "metadata": {},
+   "source": [
+    "Compute $\\pi$ by generating random points in a square and counting how many there are in the circle inscribed in the square."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "b5f095c0-58f6-4098-829c-6e696ae2a2bd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def compute_pi(nr_tries):\n",
+    "    hits = 0\n",
+    "    for _ in range(nr_tries):\n",
+    "        x = random.random()\n",
+    "        y = random.random()\n",
+    "        if x**2 + y**2 < 1.0:\n",
+    "            hits += 1\n",
+    "    return 4.0*hits/nr_tries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "805f4c9f-5d19-486a-988e-bf103683c37c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@njit\n",
+    "def compute_pi_jit(nr_tries):\n",
+    "    hits = 0\n",
+    "    for _ in range(nr_tries):\n",
+    "        x = random.random()\n",
+    "        y = random.random()\n",
+    "        if x**2 + y**2 < 1.0:\n",
+    "            hits += 1\n",
+    "    return 4.0*hits/nr_tries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "f7a7bb7e-6ad1-4b6d-bb5b-d99ebedf7991",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "@njit(['float64(int64)'])\n",
+    "def compute_pi_jit_sign(nr_tries):\n",
+    "    hits = 0\n",
+    "    for _ in range(nr_tries):\n",
+    "        x = random.random()\n",
+    "        y = random.random()\n",
+    "        if x**2 + y**2 < 1.0:\n",
+    "            hits += 1\n",
+    "    return 4.0*hits/nr_tries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "13f3c23d-674e-43b2-b503-a83c20cf5075",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "27.1 ms ± 277 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit compute_pi(100_000)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "de965fa5-b3e3-4548-8d41-661baf6abe65",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "687 µs ± 9.53 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit compute_pi_jit(100_000)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "f240d35d-2fdb-45db-9e59-d392887c9a16",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "685 µs ± 8.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)\n"
+     ]
+    }
+   ],
+   "source": [
+    "%timeit compute_pi_jit_sign(np.int64(100_000))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "da96c2f7-afc2-4122-ad80-62d7c57272e9",
+   "metadata": {},
+   "source": [
+    "Using numba's just-in-time compiler significantly speeds up the computations."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
@@ -0,0 +1,10 @@
+# Performance
+
+These are some resources about general performance considerations.
+
+
+## What is it?
+
+1. `number_puzzle.ipynb`: jupyter notebbook solving a number puzzle
+   in a variety of ways, shwoing some aspects of performance
+   optimization.