diff --git a/README.md b/README.md
index f23c3de0..f59782d4 100644
--- a/README.md
+++ b/README.md
@@ -4,66 +4,47 @@
 [![codecov](https://codecov.io/github/julialegate/cuNumeric.jl/branch/main/graph/badge.svg)](https://app.codecov.io/github/JuliaLegate/cuNumeric.jl)
 [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
 
-> [!WARNING]  
-> Leagte.jl and cuNumeric.jl are under active development at the moment. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness. Our public beta launch is targeted for Fall 2025.
 
 The cuNumeric.jl package wraps the [cuPyNumeric](https://github.com/nv-legate/cupynumeric) C++ API from NVIDIA to bring simple distributed computing on GPUs and CPUs to Julia! We provide a simple array abstraction, the `NDArray`, which supports most of the operations you would expect from a normal Julia array.
 
-This project is in alpha and we do not commit to anything necessarily working as you would expect. The current build process requires several external dependencies which are not registered on BinaryBuilder.jl yet. The build instructions and minimum pre-requesites are as follows:
-
-### Minimum prereqs
-- Ubuntu 20.04 or RHEL 8
-- Julia 1.11
+> [!WARNING]  
+> Leagte.jl and cuNumeric.jl are under active development. This is a pre-release API and is subject to change. Stability is not guaranteed until the first official release. We are actively working to improve the build experience to be more seamless and Julia-friendly. In parallel, we're developing a comprehensive testing framework to ensure reliability and robustness.
 
-### 1. Install Julia through [JuliaUp](https://github.com/JuliaLang/juliaup)
+### Quick Start
+cuNumeric.jl can be installed with the Julia package manager. From the Julia REPL, type `]` to enter the Pkg REPL mode and run:
+```julia
+pkg> add cuNumeric
 ```
-curl -fsSL https://install.julialang.org | sh -s -- --default-channel 1.11
+Or, using the `Pkg` API:
+```julia
+using Pkg; Pkg.add(url = "https://github.com/JuliaLegate/cuNumeric.jl", rev = "main")
 ```
+The first run might take awhile as it has to install multiple large dependencies such as the CUDA SDK (if you have an NVIDIA GPU). For more install instructions, please visit out install guide in the documentation.
 
-This will install version 1.11 by default since that is what we have tested against. To verify 1.11 is the default run either of the following (you may need to source bashrc):
-```bash
-juliaup status
-julia --version
-```
+To see information about your cuNumeric install run the `versioninfo` function.
 
-If 1.11 is not your default, please set it to be the default. Other versions of Julia are untested.
-```bash
-juliaup default 1.11
-```
-
-### 2. Download cuNumeric.jl (quick setup)
-cuNumeric.jl is not on the general registry yet. To add cuNumeric.jl to your environment run:
 ```julia
-using Pkg; Pkg.develop(url = "https://github.com/JuliaLegate/cuNumeric.jl")
+cuNumeric.versioninfo()
 ```
-By default, this will use [legate_jll](https://github.com/JuliaBinaryWrappers/legate_jll.jl/) and [cupynumeric_jll](https://github.com/JuliaBinaryWrappers/cupynumeric_jll.jl/). 
 
-For more build configurations and options, please visit our [installation guide](https://julialegate.github.io/cuNumeric.jl/dev/install).
+### Monte-Carlo Example
+```julia
+using cuNumeric
 
-#### 2b. Contributing to cuNumeric.jl
-To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
-```bash
-git clone https://github.com/JuliaLegate/cuNumeric.jl.git 
-julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
-julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
-julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
-julia --project=. -e 'using Pkg; Pkg.build()'
-```
+integrand = (x) -> exp.(-x.^2)
 
-To learn more about contributing to Legate.jl, check out the [Legate.jl README.md](https://github.com/JuliaLegate/Legate.jl?tab=readme-ov-file#2-download-legatejl)
+N = 1_000_000
 
-### 3. Test the Julia Package
-Run this command in the Julia environment where cuNumeric.jl is installed.
-```julia
-using Pkg; Pkg.test("cuNumeric")
-```
-With everything working, its the perfect time to checkout some of our [examples](https://julialegate.github.io/cuNumeric.jl/dev/examples)!
+x_max = 10.0f0
+domain = [-x_max, x_max]
+Ω = domain[2] - domain[1]
 
+samples = Ω*cuNumeric.rand(N) .- x_max 
+estimate = (Ω/N) * sum(integrand(samples))
 
-## Contact
-For technical questions, please either contact 
-`krasow(at)u.northwestern.edu` OR
-`emeitz(at)andrew.cmu.edu`
+println("Monte-Carlo Estimate: $(estimate)")
+```
 
-If the issue is building the package, please include the `build.log` and `.err` files found in `cuNumeric.jl/deps/` 
+### Requirements
 
+We require an x86 Linux platform and Julia 1.10 or 1.11. For GPU support we require an NVIDIA GPU and a CUDA driver which supports CUDA 13.0. ARM support is theoretically possible, but we do not make binaries or test on ARM. Please open an issue if ARM support is of interest.
diff --git a/docs/Project.toml b/docs/Project.toml
index fdebd8be..92966fdf 100644
--- a/docs/Project.toml
+++ b/docs/Project.toml
@@ -1,7 +1,15 @@
 [deps]
+cuNumeric = "0fd9ffd4-7e84-4cd0-b8f8-645bd8c73620"
+CNPreferences = "3e078157-ea10-49d5-bf32-908f777cd46f"
 Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DocumenterVitepress = "4710194d-e776-4893-9690-8d956a29c365"
 LiveServer = "16fef848-5104-11e9-1b77-fb7a48bbb589"
 
 [compat]
 Documenter = "1.5"
+cuNumeric = "0.1"
+CNPreferences = "0.1.2"
+
+[sources]
+cuNumeric = {path = ".."}
+CNPreferences = {path = "../lib/CNPreferences"}
diff --git a/docs/make.jl b/docs/make.jl
index 008972da..a5bd8f3b 100644
--- a/docs/make.jl
+++ b/docs/make.jl
@@ -2,6 +2,8 @@ using Documenter, DocumenterVitepress
 using cuNumeric
 using CNPreferences
 
+ci = get(ENV, "CI", "") == "true"
+
 makedocs(;
     sitename="cuNumeric.jl",
     authors="Ethan Meitz and David Krasowska",
@@ -12,20 +14,22 @@ makedocs(;
     ),
     pages=[
         "Home" => "index.md",
-        "Build Options" => "install.md",
+        "Install Guide" => "install.md",
         "Examples" => "examples.md",
         "Performance Tips" => "perf.md",
         "Back End Details" => "usage.md",
-        "Benchmarks" => "benchmark_results.md",
-        "How to Benchmark" => "benchmark.md",
+        "Benchmarks" => "benchmark.md",
         "Public API" => "api.md",
     ],
 )
 
-DocumenterVitepress.deploydocs(;
-    repo="github.com/JuliaLegate/cuNumeric.jl",
-    target=joinpath(@__DIR__, "build"),
-    branch="gh-pages",
-    devbranch="main",
-    push_preview=true,
-)
+if ci
+    @info "Deploying Docs to GitHub Pages"
+    DocumenterVitepress.deploydocs(;
+        repo="github.com/JuliaLegate/cuNumeric.jl",
+        target=joinpath(@__DIR__, "build"),
+        branch="gh-pages",
+        devbranch="main",
+        push_preview=true,
+    )
+end
diff --git a/docs/src/benchmark.md b/docs/src/benchmark.md
index ce7223b9..6da5b589 100644
--- a/docs/src/benchmark.md
+++ b/docs/src/benchmark.md
@@ -1,3 +1,46 @@
+# Benchmark Results
+
+For JuliaCon2025 we benchmarks cuNumeric.jl on 8 A100 GPUs (single-node) and compared it to the Python library cuPyNumeric and other relevant benchmarks depending on the problem. All results shown are weak scaling. We hope to have multi-node benchmarks soon!
+
+
+```@contents
+Pages = ["benchmark_results.md"]
+Depth = 2:2
+```
+
+## SGEMM
+
+Code Outline:
+```julia
+mul!(C, A, B)
+```
+
+GEMM Efficiency            |  GEMM GFLOPS
+:-------------------------:|:-------------------------:
+![GEMM Efficiency](images/gemm_efficiency.svg)  |  ![GEMM GFLOPS](images/gemm_gflops.svg)
+
+## Monte-Carlo Integration
+
+Monte-Carlo integration is embaressingly parallel and should scale perfectly. We do not know the exact number of operations in `exp` so the GFLOPs is off by a constant factor. 
+
+Code Outline:
+```julia
+integrand = (x) -> exp.(-x.^2)
+val = (V/N) * sum(integrand(x))
+```
+
+MC Efficiency            |  MC GFLOPS
+:-------------------------:|:-------------------------:
+![MC Efficiency](images/mc_eff.svg)  |  ![MC GFLOPS](images/mc_ops.svg)
+
+
+## Gray-Scott (2D)
+
+Solving a PDE requires halo-exchanges and lots of data movement. In this benchmark we fall an order of magnitude short of the `ImplicitGlobalGrid.jl` library which specifically targets multi-node, multi-GPU halo exchanges. We attribute this to the lack of kernel fusion in cuNumeric.jl
+
+![GS GFLOPS](images/gs_gflops_diffeq.svg)
+
+
 # Benchmarking cuNumeric.jl Programs
 
 Since there is no programatic way to set the hardware configuration (as of 24.11) benchmarking cuNumeric.jl code is a bit tedious. As an introduction, we walk through a benchmark of matrix multiplication (SGEMM). All the code for this benchmark can be found in the `cuNumeric.jl/pkg/benchmark` directory.
@@ -14,10 +57,10 @@ In this benchmark we will try to understand the weak scaling behavior of the SGE
 using cuNumeric
 
 function initialize_cunumeric(N, M)
-    A = cuNumeric.as_type(cuNumeric.rand(NDArray, N, M), Float32)
-    B = cuNumeric.as_type(cuNumeric.rand(NDArray, M, N), Float32)
+    A = cuNumeric.rand(Float32, N, M)
+    B = cuNumeric.rand(Float32, M, N)
     C = cuNumeric.zeros(Float32, N, N)
-    GC.gc() # remove the intermediate FP64 arrays
+    GC.gc() # remove any intermediate arrays
     return A, B, C
 end
 
@@ -58,10 +101,11 @@ function gemm_cunumeric(N, M, n_samples, n_warmup)
     return mean_time_ms, gflops
 end
 
+N = 100
 n_samples = 10
 n_warmup = 2
 
-mean_time_ms, gflops = gemm_cunumeric(N, n_samples, n_warmup)
+mean_time_ms, gflops = gemm_cunumeric(N, N, n_samples, n_warmup)
 ```
 
 Since there is no programatic way to set the hardware configuration we must manipulate the environment variables described in [Setting Hardware Configuration](@ref) through shell scripts to make a weak scaling plot. These variables must be set before we launch the Julia runtime where we will run our benchmark. Therefore, I do not recommend generating scaling plots from the REPL because you would have to start and stop the REPL each time to re-configure the hardware settings. To make benchmarking easier, we provide a small shell script, `run_benchmark.sh`, located in `cuNumeric.jl/pkg/benchmark`. This script will automatically set the `LEGATE_CONFIG` according to the passed flags and run the specified benchmark file.
diff --git a/docs/src/benchmark_results.md b/docs/src/benchmark_results.md
deleted file mode 100644
index 22e38f22..00000000
--- a/docs/src/benchmark_results.md
+++ /dev/null
@@ -1,41 +0,0 @@
-# Benchmark Results
-
-For JuliaCon2025 we benchmarks cuNumeric.jl on 8 A100 GPUs (single-node) and compared it to the Python library cuPyNumeric and other relevant benchmarks depending on the problem. All results shown are weak scaling. We hope to have multi-node benchmarks soon!
-
-
-```@contents
-Pages = ["benchmark_results.md"]
-Depth = 2:2
-```
-
-## SGEMM
-
-Code Outline:
-```julia
-mul!(C, A, B)
-```
-
-GEMM Efficiency            |  GEMM GFLOPS
-:-------------------------:|:-------------------------:
-![GEMM Efficiency](images/gemm_efficiency.svg)  |  ![GEMM GFLOPS](images/gemm_gflops.svg)
-
-## Monte-Carlo Integration
-
-Monte-Carlo integration is embaressingly parallel and should scale perfectly. We do not know the exact number of operations in `exp` so the GFLOPs is off by a constant factor. 
-
-Code Outline:
-```julia
-integrand = (x) -> exp(-square(x))
-val = (V/N) * sum(integrand(x))
-```
-
-MC Efficiency            |  MC GFLOPS
-:-------------------------:|:-------------------------:
-![MC Efficiency](images/mc_eff.svg)  |  ![MC GFLOPS](images/mc_ops.svg)
-
-
-## Gray-Scott (2D)
-
-Solving a PDE requires halo-exchanges and lots of data movement. In this benchmark we fall an order of magnitude short of the `ImplicitGlobalGrid.jl` library which specifically targets multi-node, multi-GPU halo exchanges. We attribute this to the lack of kernel fusion in cuNumeric.jl
-
-![GS GFLOPS](images/gs_gflops_diffeq.svg)
\ No newline at end of file
diff --git a/docs/src/dev.md b/docs/src/dev.md
index 5756c970..1bc332fd 100644
--- a/docs/src/dev.md
+++ b/docs/src/dev.md
@@ -1,6 +1,10 @@
 # Developing cuNumeric.jl
 
-There are two primary ways to develop `cuNumeric.jl`:
-- Clone the git repo and only work with `cuNumeric.jl`
-- Add `cuNumeric.jl` to another environment to test functionality with other packages
-
+To contribute to cuNumeric.jl, we recommend cloning the repository and adding it to one of your existing environments with `Pkg.develop`.
+```bash
+git clone https://github.com/JuliaLegate/cuNumeric.jl.git 
+julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl/lib/CNPreferences")'
+julia --project=. -e 'using Pkg; Pkg.develop(path = "cuNumeric.jl")'
+julia --project=. -e 'using CNPreferences; CNPreferences.use_developer_mode()'
+julia --project=. -e 'using Pkg; Pkg.build()'
+```
\ No newline at end of file
diff --git a/docs/src/errors.md b/docs/src/errors.md
index e1ac9441..41f0059e 100644
--- a/docs/src/errors.md
+++ b/docs/src/errors.md
@@ -1,19 +1,4 @@
 # Common Errors
-### [1] ERROR: LoadError: JULIA_LEGATE_XXXX_PATH not found via environment or JLL.
-This can occur for several reasons; however, this means the JLL is not available.
-For the library that failed, you can overwrite an ENV to use a custom install.
-```bash
-export JULIA_LEGATE_XXXX_PATH="/path/to/library/failing"
-```
 
-However, if you want to solve the JLL being available- you need the cuda driver `libcuda.so` on your path and cuda runtime `libcudart.so` on your path. You can use JLLs to achieve this:
-
-```bash
-echo "LD_LIBRARY_PATH=$(julia --project=[yourenv] -e 'using Pkg; \
-    Pkg.add(name = "CUDA_Driver_jll", version = "0.12.1"); \
-    using CUDA_Driver_jll; \
-    print(joinpath(CUDA_Driver_jll.artifact_dir, "lib"))' \
-):$LD_LIBRARY_PATH"
-```
-
-Note: You may use a different compatible driver version, but ensure it works with our supported CUDA toolkit/runtime versions (12.2 – 12.9). CUDA runtime 13.0 is untested and will break this package. 
+## OOM on Startup
+If you have other processes using GPU RAM (e.g. another instance of cuNumeric.jl) then cuNumeric.jl will fail to start and will segfault. The first symbol is typically something like `_ZN5Realm4CudaL22allocate_device_memoryEPNS0_3GPUEm`. You can fix this by killing the other jobs or modifying the amount of GPU RAM requested in `LEGATE_CONFIG`. See the [usage](./usage.md) documentation for examples on how to set the `LEGATE_CONFIG` environment variable.
diff --git a/docs/src/examples.md b/docs/src/examples.md
index 9c3d2c06..18dcaf0e 100644
--- a/docs/src/examples.md
+++ b/docs/src/examples.md
@@ -6,14 +6,12 @@
 # found in examples/daxpy.jl
 using cuNumeric
 
-arr = cuNumeric.rand(NDArray, 20)
+arr = cuNumeric.rand(20)
 
-α = 1.32
-b = 2.0
+α = 1.32f0
+b = 2.0f0
 
-arr2 = α*arr + b
-
-arr2[:] # disp array
+arr2 = α .* arr .+ b
 ```
 ## Monte-Carlo Integration
 
@@ -34,18 +32,24 @@ Since we cannot uniformly sample form negative to positive infinity, we truncate
 # found in examples/integrate.jl
 using cuNumeric
 
-integrand = (x) -> exp(-square(x))
+# Note that we do not yet support broadcasting
+# custom functions, so the braodcasting MUST
+# be done inside the function
+integrand = (x) -> exp.(-x.^2)
 
 N = 1_000_000
 
-x_max = 5.0
+x_max = 10.0f0
 domain = [-x_max, x_max]
 Ω = domain[2] - domain[1]
 
-samples = Ω*cuNumeric.rand(NDArray, N) - x_max 
+samples = Ω*cuNumeric.rand(N) .- x_max 
+
+# Reductions return 0D NDArrays instead 
+# of a scalar to avoid blocking runtime
 estimate = (Ω/N) * sum(integrand(samples))
 
-println("Monte-Carlo Estimate: $(estimate[1])")
+println("Monte-Carlo Estimate: $(estimate)")
 println("Analytical: $(sqrt(pi))")
 ```
 ## Gray Scott Reaction Diffusion
@@ -54,26 +58,36 @@ println("Analytical: $(sqrt(pi))")
 using cuNumeric
 using Plots
 
-struct Params
-    dx::Float64
-    dt::Float64
-    c_u::Float64
-    c_v::Float64
-    f::Float64
-    k::Float64
+struct Params{T}
+    dx::T
+    dt::T
+    c_u::T
+    c_v::T
+    f::T
+    k::T
 
-    function Params(dx=1, c_u=1.0, c_v=0.3, f=0.03, k=0.06)
-        new(dx, dx/5, c_u, c_v, f, k)
+    function Params(dx=1.0f0, c_u=1.0f0, c_v=0.3f0, f=0.03f0, k=0.06f0)
+        new{Float32}(dx, dx/5, c_u, c_v, f, k)
     end
 end
 
-function step(u, v, u_new, v_new, args::Params)
+function bc!(u_new, v_new, u, v)
+    u_new[:,1] = u[:,end-1]
+    u_new[:,end] = u[:,2]
+    u_new[1,:] = u[end-1,:]
+    u_new[end,:] = u[2,:]
+    v_new[:,1] = v[:,end-1]
+    v_new[:,end] = v[:,2]
+    v_new[1,:] = v[end-1,:]
+    v_new[end,:] = v[2,:]
+end
+
+function step!(u, v, u_new, v_new, args::Params)
     # calculate F_u and F_v functions
-    # currently we don't have NDArray^x working yet. 
-    F_u = ((-u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .* v[2:end-1, 2:end-1])) +
-            args.f*(1 .- u[2:end-1, 2:end-1]))
-    F_v = ((u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .* v[2:end-1, 2:end-1])) -
-            (args.f+args.k)*v[2:end-1, 2:end-1])
+    F_u = ((-u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .^ 2)) .+
+            args.f*(1.0f0 .- u[2:end-1, 2:end-1]))
+    F_v = ((u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .^ 2)) .-
+            (args.f+args.k).*v[2:end-1, 2:end-1])
     # 2-D Laplacian of f using array slicing, excluding boundaries
     # For an N x N array f, f_lap is the Nend x Nend array in the "middle"
     u_lap = ((u[3:end, 2:end-1] - 2*u[2:end-1, 2:end-1] + u[1:end-2, 2:end-1]) ./ args.dx^2 
@@ -86,23 +100,15 @@ function step(u, v, u_new, v_new, args::Params)
     v_new[2:end-1, 2:end-1] = ((args.c_v * v_lap) + F_v) * args.dt + v[2:end-1, 2:end-1]
 
     # Apply periodic boundary conditions
-    u_new[:,1] = u[:,end-1]
-    u_new[:,end] = u[:,2]
-    u_new[1,:] = u[end-1,:]
-    u_new[end,:] = u[2,:]
-    v_new[:,1] = v[:,end-1]
-    v_new[:,end] = v[:,2]
-    v_new[1,:] = v[end-1,:]
-    v_new[end,:] = v[2,:]
+    bc!(u_new, v_new, u, v)
 end
 
 function gray_scott()
-    anim = Animation()
+    #anim = Animation()
 
     N = 100
     dims = (N, N)
 
-    FT = Float64
     args = Params()
 
     n_steps = 2000 # number of steps to take
@@ -113,11 +119,11 @@ function gray_scott()
     u_new = cuNumeric.zeros(dims)
     v_new = cuNumeric.zeros(dims)
 
-    u[1:15,1:15] = cuNumeric.random(FT, (15,15))
-    v[1:15,1:15] = cuNumeric.random(FT, (15,15))
+    u[1:15,1:15] = cuNumeric.rand(15,15)
+    v[1:15,1:15] = cuNumeric.rand(15,15)
 
     for n in 1:n_steps
-        step(u, v, u_new, v_new, args)
+        step!(u, v, u_new, v_new, args)
         # update u and v 
         # this doesn't copy, this switching references 
         u, u_new = u_new, u
@@ -130,9 +136,10 @@ function gray_scott()
         end
     end
     gif(anim, "gray-scott.gif", fps=10)
+    return u, v
 
 end
 
-gray_scott()
+u, v = gray_scott()
 ```
 ![Simulation Output](./gray-scott.gif)
diff --git a/docs/src/install.md b/docs/src/install.md
index 02adf07f..b9e0451b 100644
--- a/docs/src/install.md
+++ b/docs/src/install.md
@@ -2,70 +2,81 @@
 
 To make customization of the build options easier we have the `CNPreferences.jl` package to generate the `LocalPreferences.toml` which is read by the build script to determine which build option to use. CNPreferences.jl will also enforce that Julia is restarted for changes to take effect.
 
-## Default Build (jlls)
 
-cuNumeric.jl is not registered yet. The easiest way to install is using `Pkg.develop`. cuNumeric.jl leverages [Binary Builder](https://github.com/JuliaPackaging/Yggdrasil) for many of its dependencies.
+## Julia Installation
+
+cuNumeric supports Julia 1.10 and 1.11. We recommend installing Julia with [juliaup](https://github.com/JuliaLang/juliaup):
+
+```
+curl -fsSL https://install.julialang.org | sh -s -- --default-channel 1.11
+```
+
+This will install version 1.11 by default since that is what we have tested against. To verify 1.11 is the default run either of the following (you may need to source bashrc):
+```bash
+juliaup status
+julia --version
+```
+
+If 1.11 is not your default, please set it to be the default. Other versions of Julia are untested.
+```bash
+juliaup default 1.11
+```
+
+## Default Build (jlls)
 
 ```julia
-using Pkg; Pkg.develop(url = "https://github.com/JuliaLegate/cuNumeric.jl")
+pkg> add cuNumeric
 ```
 If you previously used a custom build or conda build and would like to revert back to using prebuilt JLLs, run the following command in the directory containing the Project.toml of your environment.
 
 ```julia
-julia --project -e 'using CNPreferences; CNPreferences.use_jll_binary()'
+using CNPreferences; CNPreferences.use_jll_binary()
 ```
 
 `CNPreferences` is a separate module so that it can be used to configure the build settings before `cuNumeric.jl` is added to your environment. To install it separately run
 
 ```julia
-using Pkg; Pkg.add(url = "https://github.com/JuliaLegate/cuNumeric.jl", subdir="lib/CNPreferences")
+pkg> add CNPreferences
 ```
 
-By default, this will also revert any LegatePreferences you have set. It will revert Legate.jl to use JLLs. You can disable this behavior with `transitive = false` in the `use_jll_binary()` function.
-
 ## Developer mode
-> [!WARNING]  
-> This gives the most flexibility in installs. It is meant for developing on cuNumeric.jl. By default, this does not set any LegatePreferences. 
+> [!TIP]  
+> This gives the most flexibility in installs. It is meant for developing on cuNumeric.jl.
 
-We support using a custom install version of cuPyNumeric. See https://docs.nvidia.com/cupynumeric/latest/installation.html for details about different install configurations, or building cuPyNumeric from source.
+We support using a custom install version of cupynumeric. See https://docs.nvidia.com/cupynumeric/latest/installation.html for details about different install configurations, or building cupynumeric from source.
 
-We require that you have the cuda driver `libcuda.so` on your path, cuda runtime `libcudart.so`,  g++ capable compiler of C++ 20, and a recent version CMake >= 3.26.
+We require that you have a g++ capable compiler of C++ 20, and a recent version CMake >= 3.26.
 
 To use developer mode, 
 ```julia
-julia --project -e 'using CNPreferences; CNPreferences.use_developer_mode(; wrapper_branch="main", use_cupynumeric_jll=true, cupynumeric_path=nothing)'
+using CNPreferences; CNPreferences.use_developer_mode(; use_cunumeric_jll=true, cunumeric_path=nothing)
 ```
-This will clone [cunumeric_jl_wrapper](https://github.com/JuliaLegate/cunumeric_jl_wrapper) into cuNumeric.jl/deps and build from src. By default `use_cupynumeric_jll` will be set to true and `wrapper_branch` will be set to "main". However, you can set a custom branch and/or use a custom path of cupynumeric. By using disabling `use_cupynumeric_jll`, you can set `cupynumeric_path` to your custom install. 
-
-cuNumeric.jl depends on [Legate.jl](https://github.com/JuliaLegate/Legate.jl). Developer mode by default is not transitive to LegatePreferences. This means setting cuNumeric.jl to devloper mode has no impact on Legate.jl. To have both libraries on developer mode, you need to set Legate preferences manually. 
-
+By default `use_cunumeric_jll` will be set to true. However, you can set a custom branch and/or use a custom path of cupynumeric. By setting `use_cunumeric_jll=false`, you can set `cunumeric_path` to your custom install. 
 ```julia
-julia --project -e 'using LegatePreferences; LegatePreferences.use_developer_mode(; wrapper_branch="main", use_legate_jll=true, legate_path=nothing)'
+using CNPreferences; CNPreferences.use_developer_mode(;use_cunumeric_jll=false, cunumeric_path="/path/to/cupynumeric/root")
+
 ```
-LegatePreferences has similar kwargs and behavior as CNPreferences. This will clone [legate_jl_wrapper](https://github.com/JuliaLegate/legate_jl_wrapper). By default `use_legate_jll` is set to true and `wrapper branch` is set to "main" You can disable the jll and set a custom legate install with `legate_path`. 
 
 ## Link Against Existing Conda Environment
 
 > [!WARNING]  
-> This feature is not passing our CI currently. Please use with caution. We are failing to currently match proper versions of .so libraries together. Our hope is to get this functional for users already using cuPyNumeric within conda. 
+> This feature is not passing our CI currently. Please use with caution. We are failing to currently match proper versions of .so libraries together. Our hope is to get this functional for users already using Legate within conda. 
 
 Note, you need conda >= 24.1 to install the conda package. More installation details are found [here](https://docs.nvidia.com/cupynumeric/latest/installation.html).
 
 ```bash
 # with a new environment
-conda create -n myenv -c conda-forge -c legate cupynumeric
+conda create -n myenv -c conda-forge -c cupynumeric
 # into an existing environment
-conda install -c conda-forge -c legate cupynumeric
+conda install -c conda-forge -c cupynumerice
 ```
 Once you have the conda package installed, you can activate here. 
 ```bash
 conda activate [conda-env-with-cupynumeric]
 ```
 
-To update `LocalPreferences.toml` so that a local conda environment is used as the binary provider for cupynumeric run the following command. `conda_env` should be the absolute path to the conda environment (e.g., the value of CONDA_PREFIX when your environment is active). For example, this path is: `/home/JuliaLegate/.conda/envs/cunumeric-gpu`.
-
+To update `LocalPreferences.toml` so that a local conda environment is used as the binary provider for cupynumeric run the following command. `conda_env` should be the absolute path to the conda environment (e.g., the value of CONDA_PREFIX when your environment is active). For example, this path is: `/home/JuliaLegate/.conda/envs/cupynumeric-gpu`.
 ```julia
-julia --project -e 'using CNPreferences; CNPreferences.use_conda("<env-path>")'
-```
-
-By default, this will also revert any LegatePreferences you have set. It will revert Legate.jl to use JLLs. You can disable this behavior with `transitive = false` in the `use_conda()` function.
+using CNPreferences; CNPreferences.use_conda("conda-env-with-legate");
+Pkg.build()
+```
\ No newline at end of file
diff --git a/docs/src/perf.md b/docs/src/perf.md
index 51346375..d3a71e18 100644
--- a/docs/src/perf.md
+++ b/docs/src/perf.md
@@ -1,8 +1,10 @@
 # Performance Tips
 
-
 ## Avoid Scalar Indexing
-Accessing elements of an NDArray one at a time (e.g., `arr[5]`) is slow and should be avoided. Indexing like this requires data to be trasfered between device and host and maybe even communicated across nodes. In the future, scalar indexing will emit a warning which can be opted out of. Several functions in the existing API invoke scalar indexing and are intended for testing (e.g., the `==` operator). 
+Accessing elements of an NDArray one at a time (e.g., `arr[5]`) is slow and should be avoided. Indexing like this requires data to be trasfered between device and host and maybe even communicated across nodes. Scalar indexing will emit an error which can be opted out of with `@allowscalar` or `allwoscalar() do ... end`. Several functions in the existing API invoke scalar indexing and are intended for testing (e.g., the `==` operator). 
+
+## Avoid Implicit Promotion
+Mixing integral types of different size (e.g., `Float64` and `Float32`) will result in implicit promotion of the smaller type to the larger types. This creates a copy of the data and hurts performance. Implicit promotion from a smaller integral type to a larger integral type will emit an error which can be opted out of with `@allowpromotion` or `allowpromotion() do ... end`. This error is common when mixing literals with `NDArrays`. By default a floating point literal (i.e., 1.0) is `Float64` but the default type of an `NDArray` is `Float32`. 
 
 ## Kernel Fusion
-cuPyNumeric does not fuse independent operations automatically. This is a priority for the beta release.
\ No newline at end of file
+cuPyNumeric does not fuse independent operations automatically, even in broadcast expressions. This is a priority for a future release.
\ No newline at end of file
diff --git a/docs/src/usage.md b/docs/src/usage.md
index 862d542e..f8c48754 100644
--- a/docs/src/usage.md
+++ b/docs/src/usage.md
@@ -1,10 +1,4 @@
 
-
-## About NDArrays
-
-
-
-
 ## Setting Hardware Configuration
 
 There is no programatic way to set the hardware configuration used by CuPyNumeric (as of 24.11). By default, the hardware configuration is set automatically by Legate. This configuration can be manipulated through the following environment variables:
diff --git a/examples/daxpy.jl b/examples/daxpy.jl
index e870c230..db2617f9 100644
--- a/examples/daxpy.jl
+++ b/examples/daxpy.jl
@@ -1,11 +1,11 @@
 # found in examples/daxpy.jl
 using cuNumeric
 
-arr = cuNumeric.rand(NDArray, 20)
+arr = cuNumeric.rand(20)
 
-α = 1.32
-b = 2.0
+α = 1.32f0
+b = 2.0f0
 
-arr2 = α*arr + b
+arr2 = α .* arr .+ b
 
-arr2[:] # disp array
+println(arr2)
diff --git a/examples/gray-scott.jl b/examples/gray-scott.jl
index 401709e6..319e9fd2 100644
--- a/examples/gray-scott.jl
+++ b/examples/gray-scott.jl
@@ -1,83 +1,60 @@
-# found in examples/gray-scott.jl
 using cuNumeric
 # using Plots
 
-struct Params
-    dx::Float64
-    dt::Float64
-    c_u::Float64
-    c_v::Float64
-    f::Float64
-    k::Float64
+struct Params{T}
+    dx::T
+    dt::T
+    c_u::T
+    c_v::T
+    f::T
+    k::T
 
-    function Params(dx=1, c_u=1.0, c_v=0.3, f=0.03, k=0.06)
-        new(dx, dx/5, c_u, c_v, f, k)
+    function Params(dx=1.0f0, c_u=1.0f0, c_v=0.3f0, f=0.03f0, k=0.06f0)
+        new{Float32}(dx, dx/5, c_u, c_v, f, k)
     end
 end
 
-function step(u, v, u_new, v_new, args::Params)
+function bc!(u_new, v_new, u, v)
+    u_new[:,1] = u[:,end-1]
+    u_new[:,end] = u[:,2]
+    u_new[1,:] = u[end-1,:]
+    u_new[end,:] = u[2,:]
+    v_new[:,1] = v[:,end-1]
+    v_new[:,end] = v[:,2]
+    v_new[1,:] = v[end-1,:]
+    v_new[end,:] = v[2,:]
+end
+
+function step!(u, v, u_new, v_new, args::Params)
     # calculate F_u and F_v functions
-    # currently we don't have NDArray^x working yet. 
-    F_u = (
-        (
-            -u[2:(end - 1), 2:(end - 1)] .*
-            (v[2:(end - 1), 2:(end - 1)] .* v[2:(end - 1), 2:(end - 1)])
-        ) + args.f*(1 .- u[2:(end - 1), 2:(end - 1)])
-    )
-    F_v = (
-        (
-            u[2:(end - 1), 2:(end - 1)] .*
-            (v[2:(end - 1), 2:(end - 1)] .* v[2:(end - 1), 2:(end - 1)])
-        ) - (args.f+args.k)*v[2:(end - 1), 2:(end - 1)]
-    )
+    F_u = ((-u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .^ 2)) .+
+            args.f*(1.0f0 .- u[2:end-1, 2:end-1]))
+    F_v = ((u[2:end-1, 2:end-1].*(v[2:end-1, 2:end-1] .^ 2)) .-
+            (args.f+args.k).*v[2:end-1, 2:end-1])
     # 2-D Laplacian of f using array slicing, excluding boundaries
     # For an N x N array f, f_lap is the Nend x Nend array in the "middle"
-    u_lap = (
-        (
-            u[3:end, 2:(end - 1)] - 2*u[2:(end - 1), 2:(end - 1)] +
-            u[1:(end - 2), 2:(end - 1)]
-        ) ./ args.dx^2 +
-        (
-            u[2:(end - 1), 3:end] - 2*u[2:(end - 1), 2:(end - 1)] +
-            u[2:(end - 1), 1:(end - 2)]
-        ) ./ args.dx^2
-    )
-    v_lap = (
-        (
-            v[3:end, 2:(end - 1)] - 2*v[2:(end - 1), 2:(end - 1)] +
-            v[1:(end - 2), 2:(end - 1)]
-        ) ./ args.dx^2 +
-        (
-            v[2:(end - 1), 3:end] - 2*v[2:(end - 1), 2:(end - 1)] +
-            v[2:(end - 1), 1:(end - 2)]
-        ) ./ args.dx^2
-    )
+    u_lap = ((u[3:end, 2:end-1] - 2*u[2:end-1, 2:end-1] + u[1:end-2, 2:end-1]) ./ args.dx^2 
+           + (u[2:end-1, 3:end] - 2*u[2:end-1, 2:end-1] + u[2:end-1, 1:end-2]) ./ args.dx^2)
+    v_lap = ((v[3:end, 2:end-1] - 2*v[2:end-1, 2:end-1] + v[1:end-2, 2:end-1]) ./ args.dx^2 
+           + (v[2:end-1, 3:end] - 2*v[2:end-1, 2:end-1] + v[2:end-1, 1:end-2]) ./ args.dx^2)
 
     # Forward-Euler time step for all points except the boundaries
-    u_new[2:(end - 1), 2:(end - 1)] =
-        ((args.c_u * u_lap) + F_u) * args.dt + u[2:(end - 1), 2:(end - 1)]
-    v_new[2:(end - 1), 2:(end - 1)] =
-        ((args.c_v * v_lap) + F_v) * args.dt + v[2:(end - 1), 2:(end - 1)]
+    u_new[2:end-1, 2:end-1] = ((args.c_u * u_lap) + F_u) * args.dt + u[2:end-1, 2:end-1]
+    v_new[2:end-1, 2:end-1] = ((args.c_v * v_lap) + F_v) * args.dt + v[2:end-1, 2:end-1]
 
     # Apply periodic boundary conditions
-    u_new[:, 1] = u[:, end - 1]
-    u_new[:, end] = u[:, 2]
-    u_new[1, :] = u[end - 1, :]
-    u_new[end, :] = u[2, :]
-    v_new[:, 1] = v[:, end - 1]
-    v_new[:, end] = v[:, 2]
-    v_new[1, :] = v[end - 1, :]
-    v_new[end, :] = v[2, :]
+    bc!(u_new, v_new, u, v)
 end
 
 function gray_scott()
-    # anim = Animation()
-    N = 2000
+    #anim = Animation()
+
+    N = 100
     dims = (N, N)
-    FT = Float64
+
     args = Params()
 
-    n_steps = 1000 # number of steps to take
+    n_steps = 2000 # number of steps to take
     frame_interval = 200 # steps to take between making plots
 
     u = cuNumeric.ones(dims)
@@ -85,23 +62,25 @@ function gray_scott()
     u_new = cuNumeric.zeros(dims)
     v_new = cuNumeric.zeros(dims)
 
-    u[1:150, 1:150] = cuNumeric.random(FT, (150, 150))
-    v[1:150, 1:150] = cuNumeric.random(FT, (150, 150))
+    u[1:15,1:15] = cuNumeric.rand(15,15)
+    v[1:15,1:15] = cuNumeric.rand(15,15)
 
     for n in 1:n_steps
-        step(u, v, u_new, v_new, args)
+        step!(u, v, u_new, v_new, args)
         # update u and v 
         # this doesn't copy, this switching references 
         u, u_new = u_new, u
         v, v_new = v_new, v
 
-        # if n%frame_interval == 0g
+        # if n%frame_interval == 0
         #     u_cpu = u[:, :]
         #     heatmap(u_cpu, clims=(0, 1))
         #     frame(anim)
         # end
     end
     # gif(anim, "gray-scott.gif", fps=10)
+    return u, v
+
 end
 
-gray_scott()
+u, v = gray_scott()
\ No newline at end of file
diff --git a/examples/integrate.jl b/examples/integrate.jl
index 946502c0..0c941de4 100644
--- a/examples/integrate.jl
+++ b/examples/integrate.jl
@@ -1,16 +1,21 @@
-# found in examples/integrate.jl
 using cuNumeric
 
-integrand = (x) -> exp(-square(x))
+# Note that we do not yet support broadcasting
+# custom functions, so the braodcasting MUST
+# be done inside the function
+integrand = (x) -> exp.(-x.^2)
 
 N = 1_000_000
 
-x_max = 5.0
+x_max = 10.0f0
 domain = [-x_max, x_max]
 Ω = domain[2] - domain[1]
 
-samples = Ω*cuNumeric.rand(NDArray, N) - x_max
+samples = Ω*cuNumeric.rand(N) .- x_max 
+
+# Reductions return 0D NDArrays instead 
+# of a scalar to avoid blocking runtime
 estimate = (Ω/N) * sum(integrand(samples))
 
-println("Monte-Carlo Estimate: $(estimate[1])")
-println("Analytical: $(sqrt(pi))")
+println("Monte-Carlo Estimate: $(estimate)")
+println("Analytical: $(sqrt(pi))")
\ No newline at end of file
diff --git a/src/ndarray/ndarray.jl b/src/ndarray/ndarray.jl
index 02ec68e5..87c5cd66 100644
--- a/src/ndarray/ndarray.jl
+++ b/src/ndarray/ndarray.jl
@@ -188,9 +188,6 @@ Base.eltype(arr::NDArray{T}) where {T} = T
 
 Return the number of dimensions of the `NDArray`.
 
-Both functions query the underlying cuNumeric API to get
-the dimensionality of the array.
-
 # Examples
 ```@repl
 arr = cuNumeric.rand(2, 3, 4);
@@ -210,9 +207,6 @@ Return the size of the given `NDArray`.
 - `Base.size(arr)` returns a tuple of dimensions of the array.
 - `Base.size(arr, dim)` returns the size of the array along the specified dimension `dim`.
 
-These override Base's size methods for the `NDArray` type,
-using the underlying cuNumeric API to query array shape.
-
 # Examples
 ```@repl
 arr = cuNumeric.rand(3, 4, 5);
@@ -230,10 +224,6 @@ Base.size(arr::NDArray, dim::Int) = Base.size(arr)[dim]
 
 Provide the first and last valid indices along a given dimension `dim` for `NDArray`.
 
-- `firstindex` always returns 1, since Julia arrays are 1-indexed.
-- `lastindex` returns the size of the array along the specified dimension.
-- `lastindex(arr)` returns the size along the first dimension.
-
 # Examples
 ```@repl
 arr = cuNumeric.rand(4, 5);
@@ -253,12 +243,16 @@ Base.IndexStyle(::NDArray) = IndexCartesian()
 
 function Base.show(io::IO, arr::NDArray{T,0}) where {T}
     println(io, "0-dimensional NDArray{$(T),0}")
-    print(io, arr[]) #! should I assert scalar??
+    allowscalar() do
+        print(io, arr[])
+    end
 end
 
 function Base.show(io::IO, ::MIME"text/plain", arr::NDArray{T,0}) where {T}
     println(io, "0-dimensional NDArray{$(T),0}")
-    print(io, arr[]) #! should I assert scalar??
+    allowscalar() do
+        print(io, arr[])
+    end
 end
 
 function Base.show(io::IO, arr::NDArray{T,D}) where {T,D}
@@ -519,7 +513,7 @@ falses(dims::Int...) = cuNumeric.full(dims, false)
     cuNumeric.zeros([T=Float32,] dims::Tuple)
 
 Create an NDArray with element type `T`, of all zeros with size specified by `dims`.
-This function mirrors the signature of `Base.zeros`, and defaults to `Float32` when the type is omitted.
+The default type is Float32 if not specified.
 
 # Examples
 ```@repl
@@ -562,7 +556,7 @@ end
     cuNumeric.ones([T=Float32,] dims::Tuple)
 
 Create an NDArray with element type `T`, of all zeros with size specified by `dims`.
-This function has the same signature as `Base.ones`, so be sure to call it as `cuNuermic.ones`.
+The default type is Float32 if not specified.
 
 # Examples
 ```@repl
@@ -605,22 +599,19 @@ Fills `arr` with AbstractFloats uniformly at random.
 
 Create a new `NDArray` of element type Float64, filled with uniform random values.
 
-This function uses the same signature as `Base.rand` with a custom backend,
-and currently supports only `Float64` with uniform distribution (`code = 0`).
+The backend currently supports only `Float64` with uniform distribution.
 In order to support other Floats, we type convert for the user automatically.
+This can create extra allocations.
 
 # Examples
 ```@repl
-cuNumeric.rand(NDArray, 2, 2)
-cuNumeric.rand(NDArray, (4, 1))
+cuNumeric.rand(2, 2)
+cuNumeric.rand((4, 1))
 A = cuNumeric.zeros(2, 2); cuNumeric.rand!(A)
 ```
 """
 Random.rand!(arr::NDArray{Float64}) = cuNumeric.nda_random(arr, 0)
-rand(::Type{NDArray}, dims::Dims) = cuNumeric.nda_random_array(UInt64.(collect(dims)))
-rand(::Type{NDArray}, dims::Int...) = cuNumeric.rand(NDArray, dims)
-rand(dims::Dims) = cuNumeric.rand(NDArray, dims)
-rand(dims::Int...) = cuNumeric.rand(NDArray, dims)
+Random.rand!(arr::NDArray{T}) where T = error("rand! only supports NDArray{Float64} for now. Cast with cuNumeric.as_type.")
 
 function rand(::Type{T}, dims::Dims) where {T<:AbstractFloat}
     arrfp64 = cuNumeric.nda_random_array(UInt64.(collect(dims)))
@@ -629,6 +620,8 @@ function rand(::Type{T}, dims::Dims) where {T<:AbstractFloat}
 end
 
 rand(::Type{T}, dims::Int...) where {T<:AbstractFloat} = cuNumeric.rand(T, dims)
+rand(dims::Dims) = cuNumeric.rand(DEFAULT_FLOAT, dims)
+rand(dims::Int...) = cuNumeric.rand(DEFAULT_FLOAT, dims)
 
 #### OPERATIONS ####
 @doc"""
@@ -672,7 +665,6 @@ Currently supports arrays up to 3 dimensions. For higher dimensions, returns `fa
     This function uses scalar indexing and should not be used in production code. This is meant for testing.
 
 
-
 # Examples
 ```@repl
 a = cuNumeric.ones(2, 2)
@@ -704,7 +696,6 @@ Returns `false` otherwise (including if sizes differ, with a warning).
     This function uses scalar indexing and should not be used in production code. This is meant for testing.
 
 
-
 # Examples
 ```@repl
 arr = cuNumeric.ones(2, 2)
@@ -739,7 +730,6 @@ a common comparison function.
     This function uses scalar indexing and should not be used in production code. This is meant for testing.
 
 
-
 # Examples
 ```@repl
 arr1 = cuNumeric.ones(2, 2)
diff --git a/src/util.jl b/src/util.jl
index d2b64f16..a47f3ad5 100644
--- a/src/util.jl
+++ b/src/util.jl
@@ -5,7 +5,7 @@ Returns the timestamp in microseconds. Blocks on all Legate operations
 preceding the call to this function.
 """
 function get_time_microseconds()
-    return Legate.value(Legate.time_microseconds())
+    return Legate.time_microseconds()
 end
 
 @doc"""
@@ -13,5 +13,5 @@ Returns the timestamp in nanoseconds. Blocks on all Legate operations
 preceding the call to this function.
 """
 function get_time_nanoseconds()
-    return Legate.value(Legate.time_nanoseconds())
+    return Legate.time_nanoseconds()
 end
diff --git a/test/runtests.jl b/test/runtests.jl
index 60137c4c..a256d3b8 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -390,8 +390,8 @@ end
                 @test cuNumeric.compare(c_base, c_scoped, atol(T), rtol(T))
             end
 
-            u_rand = cuNumeric.as_type(cuNumeric.rand(NDArray, (15, 15)), T)
-            v_rand = cuNumeric.as_type(cuNumeric.rand(NDArray, (15, 15)), T)
+            u_rand = cuNumeric.rand(T, (15, 15))
+            v_rand = cuNumeric.rand(T, (15, 15))
 
             u, v = gray_scott_base(T, N, u_rand, v_rand)
             u_scoped, v_scoped = gray_scott(T, N, u_rand, v_rand)