Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 26 additions & 45 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,66 +4,47 @@
[![codecov](https://codecov.io/github/julialegate/cuNumeric.jl/branch/main/graph/badge.svg)](https://app.codecov.io/github/JuliaLegate/cuNumeric.jl)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

> [!WARNING]
> Leagte.jl and cuNumeric.jl are under active development at the moment. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness. Our public beta launch is targeted for Fall 2025.

The cuNumeric.jl package wraps the [cuPyNumeric](https://github.com/nv-legate/cupynumeric) C++ API from NVIDIA to bring simple distributed computing on GPUs and CPUs to Julia! We provide a simple array abstraction, the `NDArray`, which supports most of the operations you would expect from a normal Julia array.

This project is in alpha and we do not commit to anything necessarily working as you would expect. The current build process requires several external dependencies which are not registered on BinaryBuilder.jl yet. The build instructions and minimum pre-requesites are as follows:

### Minimum prereqs
- Ubuntu 20.04 or RHEL 8
- Julia 1.11
> [!WARNING]
> Leagte.jl and cuNumeric.jl are under active development. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness.

### 1. Install Julia through [JuliaUp](https://github.com/JuliaLang/juliaup)
### Quick Start
cuNumeric.jl can be installed with the Julia package manager. From the Julia REPL, type `]` to enter the Pkg REPL mode and run:
```julia
pkg> add cuNumeric
```
curl -fsSL https://install.julialang.org | sh -s -- --default-channel 1.11
Or, using the `Pkg` API:
```julia
using Pkg; Pkg.add(url = "https://github.com/JuliaLegate/cuNumeric.jl", rev = "main")
```
The first run might take awhile as it has to install multiple large dependencies such as the CUDA SDK (if you have an NVIDIA GPU). For more install instructions, please visit out install guide in the documentation.

This will install version 1.11 by default since that is what we have tested against. To verify 1.11 is the default run either of the following (you may need to source bashrc):
```bash
juliaup status
julia --version
```
To see information about your cuNumeric install run the `versioninfo` function.

If 1.11 is not your default, please set it to be the default. Other versions of Julia are untested.
```bash
juliaup default 1.11
```

### 2. Download cuNumeric.jl (quick setup)
cuNumeric.jl is not on the general registry yet. To add cuNumeric.jl to your environment run:
```julia
using Pkg; Pkg.develop(url = "https://github.com/JuliaLegate/cuNumeric.jl")
cuNumeric.versioninfo()
```
By default, this will use [legate_jll](https://github.com/JuliaBinaryWrappers/legate_jll.jl/) and [cupynumeric_jll](https://github.com/JuliaBinaryWrappers/cupynumeric_jll.jl/).

For more build configurations and options, please visit our [installation guide](https://julialegate.github.io/cuNumeric.jl/dev/install).
### Monte-Carlo Example
```julia
using cuNumeric

#### 2b. Contributing to cuNumeric.jl
To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
```bash
git clone https://github.com/JuliaLegate/cuNumeric.jl.git
julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
julia --project=. -e 'using Pkg; Pkg.build()'
```
integrand = (x) -> exp.(-x.^2)

To learn more about contributing to Legate.jl, check out the [Legate.jl README.md](https://github.com/JuliaLegate/Legate.jl?tab=readme-ov-file#2-download-legatejl)
N = 1_000_000

### 3. Test the Julia Package
Run this command in the Julia environment where cuNumeric.jl is installed.
```julia
using Pkg; Pkg.test("cuNumeric")
```
With everything working, its the perfect time to checkout some of our [examples](https://julialegate.github.io/cuNumeric.jl/dev/examples)!
x_max = 10.0f0
domain = [-x_max, x_max]
Ω = domain[2] - domain[1]

samples = Ω*cuNumeric.rand(N) .- x_max
estimate = (Ω/N) * sum(integrand(samples))

## Contact
For technical questions, please either contact
`krasow(at)u.northwestern.edu` OR
`emeitz(at)andrew.cmu.edu`
println("Monte-Carlo Estimate: $(estimate)")
```

If the issue is building the package, please include the `build.log` and `.err` files found in `cuNumeric.jl/deps/`
### Requirements

We require an x86 Linux platform and Julia 1.10 or 1.11. For GPU support we require an NVIDIA GPU and a CUDA driver which supports CUDA 13.0. ARM support is theoretically possible, but we do not make binaries or test on ARM. Please open an issue if ARM support is of interest.
8 changes: 8 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
[deps]
cuNumeric = "0fd9ffd4-7e84-4cd0-b8f8-645bd8c73620"
CNPreferences = "3e078157-ea10-49d5-bf32-908f777cd46f"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"

[compat]
Documenter = "1.5"
cuNumeric = "0.1"
CNPreferences = "0.1.2"

[sources]
cuNumeric = {path = ".."}
CNPreferences = {path = "../lib/CNPreferences"}
24 changes: 14 additions & 10 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ using Documenter, DocumenterVitepress
using cuNumeric
using CNPreferences

ci = get(ENV, "CI", "") == "true"

makedocs(;
sitename="cuNumeric.jl",
authors="Ethan Meitz and David Krasowska",
Expand All @@ -12,20 +14,22 @@ makedocs(;
),
pages=[
"Home" => "index.md",
"Build Options" => "install.md",
"Install Guide" => "install.md",
"Examples" => "examples.md",
"Performance Tips" => "perf.md",
"Back End Details" => "usage.md",
"Benchmarks" => "benchmark_results.md",
"How to Benchmark" => "benchmark.md",
"Benchmarks" => "benchmark.md",
"Public API" => "api.md",
],
)

DocumenterVitepress.deploydocs(;
repo="github.com/JuliaLegate/cuNumeric.jl",
target=joinpath(@__DIR__, "build"),
branch="gh-pages",
devbranch="main",
push_preview=true,
)
if ci
@info "Deploying Docs to GitHub Pages"
DocumenterVitepress.deploydocs(;
repo="github.com/JuliaLegate/cuNumeric.jl",
target=joinpath(@__DIR__, "build"),
branch="gh-pages",
devbranch="main",
push_preview=true,
)
end
52 changes: 48 additions & 4 deletions docs/src/benchmark.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,46 @@
# Benchmark Results

For JuliaCon2025 we benchmarks cuNumeric.jl on 8 A100 GPUs (single-node) and compared it to the Python library cuPyNumeric and other relevant benchmarks depending on the problem. All results shown are weak scaling. We hope to have multi-node benchmarks soon!


```@contents
Pages = ["benchmark_results.md"]
Depth = 2:2
```

## SGEMM

Code Outline:
```julia
mul!(C, A, B)
```

GEMM Efficiency | GEMM GFLOPS
:-------------------------:|:-------------------------:
![GEMM Efficiency](images/gemm_efficiency.svg) | ![GEMM GFLOPS](images/gemm_gflops.svg)

## Monte-Carlo Integration

Monte-Carlo integration is embaressingly parallel and should scale perfectly. We do not know the exact number of operations in `exp` so the GFLOPs is off by a constant factor.

Code Outline:
```julia
integrand = (x) -> exp.(-x.^2)
val = (V/N) * sum(integrand(x))
```

MC Efficiency | MC GFLOPS
:-------------------------:|:-------------------------:
![MC Efficiency](images/mc_eff.svg) | ![MC GFLOPS](images/mc_ops.svg)


## Gray-Scott (2D)

Solving a PDE requires halo-exchanges and lots of data movement. In this benchmark we fall an order of magnitude short of the `ImplicitGlobalGrid.jl` library which specifically targets multi-node, multi-GPU halo exchanges. We attribute this to the lack of kernel fusion in cuNumeric.jl

![GS GFLOPS](images/gs_gflops_diffeq.svg)


# Benchmarking cuNumeric.jl Programs

Since there is no programatic way to set the hardware configuration (as of 24.11) benchmarking cuNumeric.jl code is a bit tedious. As an introduction, we walk through a benchmark of matrix multiplication (SGEMM). All the code for this benchmark can be found in the `cuNumeric.jl/pkg/benchmark` directory.
Expand All @@ -14,10 +57,10 @@ In this benchmark we will try to understand the weak scaling behavior of the SGE
using cuNumeric

function initialize_cunumeric(N, M)
A = cuNumeric.as_type(cuNumeric.rand(NDArray, N, M), Float32)
B = cuNumeric.as_type(cuNumeric.rand(NDArray, M, N), Float32)
A = cuNumeric.rand(Float32, N, M)
B = cuNumeric.rand(Float32, M, N)
C = cuNumeric.zeros(Float32, N, N)
GC.gc() # remove the intermediate FP64 arrays
GC.gc() # remove any intermediate arrays
return A, B, C
end

Expand Down Expand Up @@ -58,10 +101,11 @@ function gemm_cunumeric(N, M, n_samples, n_warmup)
return mean_time_ms, gflops
end

N = 100
n_samples = 10
n_warmup = 2

mean_time_ms, gflops = gemm_cunumeric(N, n_samples, n_warmup)
mean_time_ms, gflops = gemm_cunumeric(N, N, n_samples, n_warmup)
```

Since there is no programatic way to set the hardware configuration we must manipulate the environment variables described in [Setting Hardware Configuration](@ref) through shell scripts to make a weak scaling plot. These variables must be set before we launch the Julia runtime where we will run our benchmark. Therefore, I do not recommend generating scaling plots from the REPL because you would have to start and stop the REPL each time to re-configure the hardware settings. To make benchmarking easier, we provide a small shell script, `run_benchmark.sh`, located in `cuNumeric.jl/pkg/benchmark`. This script will automatically set the `LEGATE_CONFIG` according to the passed flags and run the specified benchmark file.
Expand Down
41 changes: 0 additions & 41 deletions docs/src/benchmark_results.md

This file was deleted.

12 changes: 8 additions & 4 deletions docs/src/dev.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Developing cuNumeric.jl

There are two primary ways to develop `cuNumeric.jl`:
- Clone the git repo and only work with `cuNumeric.jl`
- Add `cuNumeric.jl` to another environment to test functionality with other packages

To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
```bash
git clone https://github.com/JuliaLegate/cuNumeric.jl.git
julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
julia --project=. -e 'using Pkg; Pkg.build()'
```
19 changes: 2 additions & 17 deletions docs/src/errors.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,4 @@
# Common Errors
### [1] ERROR: LoadError: JULIA_LEGATE_XXXX_PATH not found via environment or JLL.
This can occur for several reasons; however, this means the JLL is not available.
For the library that failed, you can overwrite an ENV to use a custom install.
```bash
export JULIA_LEGATE_XXXX_PATH="/path/to/library/failing"
```

However, if you want to solve the JLL being available- you need the cuda driver `libcuda.so` on your path and cuda runtime `libcudart.so` on your path. You can use JLLs to achieve this:

```bash
echo "LD_LIBRARY_PATH=$(julia --project=[yourenv] -e 'using Pkg; \
Pkg.add(name = "CUDA_Driver_jll", version = "0.12.1"); \
using CUDA_Driver_jll; \
print(joinpath(CUDA_Driver_jll.artifact_dir, "lib"))' \
):$LD_LIBRARY_PATH"
```

Note: You may use a different compatible driver version, but ensure it works with our supported CUDA toolkit/runtime versions (12.2 – 12.9). CUDA runtime 13.0 is untested and will break this package.
## OOM on Startup
If you have other processes using GPU RAM (e.g. another instance of cuNumeric.jl) then cuNumeric.jl will fail to start and will segfault. The first symbol is typically something like `_ZN5Realm4CudaL22allocate_device_memoryEPNS0_3GPUEm`. You can fix this by killing the other jobs or modifying the amount of GPU RAM requested in `LEGATE_CONFIG`. See the [usage](./usage.md) documentation for examples on how to set the `LEGATE_CONFIG` environment variable.
Loading
Loading