JuliaLegate · ejmeitz · Dec 29, 2025 · Jan 17, 2026 · Jan 19, 2026 · Feb 9, 2026
diff --git a/README.md b/README.md
@@ -4,66 +4,47 @@
 [![codecov](https://codecov.io/github/julialegate/cuNumeric.jl/branch/main/graph/badge.svg)](https://app.codecov.io/github/JuliaLegate/cuNumeric.jl)
 [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
 
-> [!WARNING]  
-> Leagte.jl and cuNumeric.jl are under active development at the moment. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness. Our public beta launch is targeted for Fall 2025.
 
 The cuNumeric.jl package wraps the [cuPyNumeric](https://github.com/nv-legate/cupynumeric) C++ API from NVIDIA to bring simple distributed computing on GPUs and CPUs to Julia! We provide a simple array abstraction, the `NDArray`, which supports most of the operations you would expect from a normal Julia array.
 
-This project is in alpha and we do not commit to anything necessarily working as you would expect. The current build process requires several external dependencies which are not registered on BinaryBuilder.jl yet. The build instructions and minimum pre-requesites are as follows:
-
-### Minimum prereqs
-- Ubuntu 20.04 or RHEL 8
-- Julia 1.11
+> [!WARNING]  
+> Leagte.jl and cuNumeric.jl are under active development. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness.
 
-### 1. Install Julia through [JuliaUp](https://github.com/JuliaLang/juliaup)
+### Quick Start
+cuNumeric.jl can be installed with the Julia package manager. From the Julia REPL, type `]` to enter the Pkg REPL mode and run:
+```julia
+pkg> add cuNumeric
 ```
-curl -fsSL https://install.julialang.org | sh -s -- --default-channel 1.11
+Or, using the `Pkg` API:
+```julia
+using Pkg; Pkg.add(url = "https://github.com/JuliaLegate/cuNumeric.jl", rev = "main")
 ```
+The first run might take awhile as it has to install multiple large dependencies such as the CUDA SDK (if you have an NVIDIA GPU). For more install instructions, please visit out install guide in the documentation.
 
-This will install version 1.11 by default since that is what we have tested against. To verify 1.11 is the default run either of the following (you may need to source bashrc):
-```bash
-juliaup status
-julia --version
-```
+To see information about your cuNumeric install run the `versioninfo` function.
 
-If 1.11 is not your default, please set it to be the default. Other versions of Julia are untested.
-```bash
-juliaup default 1.11
-```
-
-### 2. Download cuNumeric.jl (quick setup)
-cuNumeric.jl is not on the general registry yet. To add cuNumeric.jl to your environment run:
 ```julia
-using Pkg; Pkg.develop(url = "https://github.com/JuliaLegate/cuNumeric.jl")
+cuNumeric.versioninfo()
 ```
-By default, this will use [legate_jll](https://github.com/JuliaBinaryWrappers/legate_jll.jl/) and [cupynumeric_jll](https://github.com/JuliaBinaryWrappers/cupynumeric_jll.jl/). 
 
-For more build configurations and options, please visit our [installation guide](https://julialegate.github.io/cuNumeric.jl/dev/install).
+### Monte-Carlo Example
+```julia
+using cuNumeric
 
-#### 2b. Contributing to cuNumeric.jl
-To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
-```bash
-git clone https://github.com/JuliaLegate/cuNumeric.jl.git 
-julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
-julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
-julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
-julia --project=. -e 'using Pkg; Pkg.build()'
-```
+integrand = (x) -> exp.(-x.^2)
 
-To learn more about contributing to Legate.jl, check out the [Legate.jl README.md](https://github.com/JuliaLegate/Legate.jl?tab=readme-ov-file#2-download-legatejl)
+N = 1_000_000
 
-### 3. Test the Julia Package
-Run this command in the Julia environment where cuNumeric.jl is installed.
-```julia
-using Pkg; Pkg.test("cuNumeric")
-```
-With everything working, its the perfect time to checkout some of our [examples](https://julialegate.github.io/cuNumeric.jl/dev/examples)!
+x_max = 10.0f0
+domain = [-x_max, x_max]
+Ω = domain[2] - domain[1]
 
+samples = Ω*cuNumeric.rand(N) .- x_max 
+estimate = (Ω/N) * sum(integrand(samples))
 
-## Contact
-For technical questions, please either contact 
-`krasow(at)u.northwestern.edu` OR
-`emeitz(at)andrew.cmu.edu`
+println("Monte-Carlo Estimate: $(estimate)")
+```
 
-If the issue is building the package, please include the `build.log` and `.err` files found in `cuNumeric.jl/deps/` 
+### Requirements
 
+We require an x86 Linux platform and Julia 1.10 or 1.11. For GPU support we require an NVIDIA GPU and a CUDA driver which supports CUDA 13.0. ARM support is theoretically possible, but we do not make binaries or test on ARM. Please open an issue if ARM support is of interest.
diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,7 +1,15 @@
 [deps]
+cuNumeric = "0fd9ffd4-7e84-4cd0-b8f8-645bd8c73620"
+CNPreferences = "3e078157-ea10-49d5-bf32-908f777cd46f"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
 LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
 
 [compat]
 Documenter = "1.5"
+cuNumeric = "0.1"
+CNPreferences = "0.1.2"
+
+[sources]
+cuNumeric = {path = ".."}
+CNPreferences = {path = "../lib/CNPreferences"}
diff --git a/docs/make.jl b/docs/make.jl
@@ -2,6 +2,8 @@ using Documenter, DocumenterVitepress
 using cuNumeric
 using CNPreferences
 
+ci = get(ENV, "CI", "") == "true"
+
 makedocs(;
     sitename="cuNumeric.jl",
     authors="Ethan Meitz and David Krasowska",
@@ -12,20 +14,22 @@ makedocs(;
     ),
     pages=[
         "Home" => "index.md",
-        "Build Options" => "install.md",
+        "Install Guide" => "install.md",
         "Examples" => "examples.md",
         "Performance Tips" => "perf.md",
         "Back End Details" => "usage.md",
-        "Benchmarks" => "benchmark_results.md",
-        "How to Benchmark" => "benchmark.md",
+        "Benchmarks" => "benchmark.md",
         "Public API" => "api.md",
     ],
 )
 
-DocumenterVitepress.deploydocs(;
-    repo="github.com/JuliaLegate/cuNumeric.jl",
-    target=joinpath(@__DIR__, "build"),
-    branch="gh-pages",
-    devbranch="main",
-    push_preview=true,
-)
+if ci
+    @info "Deploying Docs to GitHub Pages"
+    DocumenterVitepress.deploydocs(;
+        repo="github.com/JuliaLegate/cuNumeric.jl",
+        target=joinpath(@__DIR__, "build"),
+        branch="gh-pages",
+        devbranch="main",
+        push_preview=true,
+    )
+end
diff --git a/docs/src/benchmark.md b/docs/src/benchmark.md
@@ -1,3 +1,46 @@
+# Benchmark Results
+
+For JuliaCon2025 we benchmarks cuNumeric.jl on 8 A100 GPUs (single-node) and compared it to the Python library cuPyNumeric and other relevant benchmarks depending on the problem. All results shown are weak scaling. We hope to have multi-node benchmarks soon!
+
+
+```@contents
+Pages = ["benchmark_results.md"]
+Depth = 2:2
+```
+
+## SGEMM
+
+Code Outline:
+```julia
+mul!(C, A, B)
+```
+
+GEMM Efficiency            |  GEMM GFLOPS
+:-------------------------:|:-------------------------:
+![GEMM Efficiency](images/gemm_efficiency.svg)  |  ![GEMM GFLOPS](images/gemm_gflops.svg)
+
+## Monte-Carlo Integration
+
+Monte-Carlo integration is embaressingly parallel and should scale perfectly. We do not know the exact number of operations in `exp` so the GFLOPs is off by a constant factor. 
+
+Code Outline:
+```julia
+integrand = (x) -> exp.(-x.^2)
+val = (V/N) * sum(integrand(x))
+```
+
+MC Efficiency            |  MC GFLOPS
+:-------------------------:|:-------------------------:
+![MC Efficiency](images/mc_eff.svg)  |  ![MC GFLOPS](images/mc_ops.svg)
+
+
+## Gray-Scott (2D)
+
+Solving a PDE requires halo-exchanges and lots of data movement. In this benchmark we fall an order of magnitude short of the `ImplicitGlobalGrid.jl` library which specifically targets multi-node, multi-GPU halo exchanges. We attribute this to the lack of kernel fusion in cuNumeric.jl
+
+![GS GFLOPS](images/gs_gflops_diffeq.svg)
+
+
 # Benchmarking cuNumeric.jl Programs
 
 Since there is no programatic way to set the hardware configuration (as of 24.11) benchmarking cuNumeric.jl code is a bit tedious. As an introduction, we walk through a benchmark of matrix multiplication (SGEMM). All the code for this benchmark can be found in the `cuNumeric.jl/pkg/benchmark` directory.
@@ -14,10 +57,10 @@ In this benchmark we will try to understand the weak scaling behavior of the SGE
 using cuNumeric
 
 function initialize_cunumeric(N, M)
-    A = cuNumeric.as_type(cuNumeric.rand(NDArray, N, M), Float32)
-    B = cuNumeric.as_type(cuNumeric.rand(NDArray, M, N), Float32)
+    A = cuNumeric.rand(Float32, N, M)
+    B = cuNumeric.rand(Float32, M, N)
     C = cuNumeric.zeros(Float32, N, N)
-    GC.gc() # remove the intermediate FP64 arrays
+    GC.gc() # remove any intermediate arrays
     return A, B, C
 end
 
@@ -58,10 +101,11 @@ function gemm_cunumeric(N, M, n_samples, n_warmup)
     return mean_time_ms, gflops
 end
 
+N = 100
 n_samples = 10
 n_warmup = 2
 
-mean_time_ms, gflops = gemm_cunumeric(N, n_samples, n_warmup)
+mean_time_ms, gflops = gemm_cunumeric(N, N, n_samples, n_warmup)
 ```
 
 Since there is no programatic way to set the hardware configuration we must manipulate the environment variables described in [Setting Hardware Configuration](@ref) through shell scripts to make a weak scaling plot. These variables must be set before we launch the Julia runtime where we will run our benchmark. Therefore, I do not recommend generating scaling plots from the REPL because you would have to start and stop the REPL each time to re-configure the hardware settings. To make benchmarking easier, we provide a small shell script, `run_benchmark.sh`, located in `cuNumeric.jl/pkg/benchmark`. This script will automatically set the `LEGATE_CONFIG` according to the passed flags and run the specified benchmark file.

diff --git a/docs/src/benchmark_results.md b/docs/src/benchmark_results.md
diff --git a/docs/src/dev.md b/docs/src/dev.md
@@ -1,6 +1,10 @@
 # Developing cuNumeric.jl
 
-There are two primary ways to develop `cuNumeric.jl`:
-- Clone the git repo and only work with `cuNumeric.jl`
-- Add `cuNumeric.jl` to another environment to test functionality with other packages
-
+To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
+```bash
+git clone https://github.com/JuliaLegate/cuNumeric.jl.git 
+julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
+julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
+julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
+julia --project=. -e 'using Pkg; Pkg.build()'
+```
diff --git a/docs/src/errors.md b/docs/src/errors.md
@@ -1,19 +1,4 @@
 # Common Errors
-### [1] ERROR: LoadError: JULIA_LEGATE_XXXX_PATH not found via environment or JLL.
-This can occur for several reasons; however, this means the JLL is not available.
-For the library that failed, you can overwrite an ENV to use a custom install.
-```bash
-export JULIA_LEGATE_XXXX_PATH="/path/to/library/failing"
-```
 
-However, if you want to solve the JLL being available- you need the cuda driver `libcuda.so` on your path and cuda runtime `libcudart.so` on your path. You can use JLLs to achieve this:
-
-```bash
-echo "LD_LIBRARY_PATH=$(julia --project=[yourenv] -e 'using Pkg; \
-    Pkg.add(name = "CUDA_Driver_jll", version = "0.12.1"); \
-    using CUDA_Driver_jll; \
-    print(joinpath(CUDA_Driver_jll.artifact_dir, "lib"))' \
-):$LD_LIBRARY_PATH"
-```
-
-Note: You may use a different compatible driver version, but ensure it works with our supported CUDA toolkit/runtime versions (12.2 – 12.9). CUDA runtime 13.0 is untested and will break this package. 
+## OOM on Startup
+If you have other processes using GPU RAM (e.g. another instance of cuNumeric.jl) then cuNumeric.jl will fail to start and will segfault. The first symbol is typically something like `_ZN5Realm4CudaL22allocate_device_memoryEPNS0_3GPUEm`. You can fix this by killing the other jobs or modifying the amount of GPU RAM requested in `LEGATE_CONFIG`. See the [usage](./usage.md) documentation for examples on how to set the `LEGATE_CONFIG` environment variable.