Skip to content

npcooley/BiocMetal

Repository files navigation

BiocMetal

Nicholas Cooley 2025-12-22

Infrastructure for calling Apple’s Metal framework from R

This package is still under construction

Overview

BiocMetal provides minimal and robust infrastructure for integrating Apple’s Metal Framework into R code bases. Metal is Apple’s GPU computing framework; similar to CUDA (NVIDIA), or OpenCL (cross-platform). It is optimized specifically for Apple Silicon and is (allegedly) compatible with older Intel-based Macs.

This vignette introduces the basic workflow for using BiocMetal to integrate simple and slightly more than simple Metal acceleration into R functions. This package also contains a small library of simple kernels. Because Metal’s API is in Obj-C development here can be glacial in pace at times, but the long term goal is for this package to serve as a lightweight and robust dependency for developing and deploying code in Bioconductor that can reliably leverage GPUs (or other non-CPU devices) to accelerate complex compute tasks.

System Requirements

  • macOS 10.13 or later
  • Metal-compatible GPU (all modern Macs - some older ones)
  • R 4.5.0 or later

General package philophy and workflow

The BiocMetal workflow follows a paradigm similar to Simon Urbanek’s OpenCL R package. User exposed R and C functions follow the same thematic format. But because Metal’s API is in Obj-C, this requires an extra layer of abstraction under the hood. One up front difference is that OpenCL performs runtime compilation of kernels, while Metal seems to expect, but does not require, buildtime compilation. Because users have access to either paradigm and they both have their place in development and deployment, functions for kernel compilation are standalone and can be used flexibly for compilation at buildtime or runtime - BiocMetal does not compile its shipped kernels upon installation or loading into the namespace, or ship compiled kernels.

Non-comprehensive list of package responsibilities

Formally, BiocMetal needs to perform all items in the below list to serve as infrastructure to users for either on the fly analysis and testing, or formal/informal integration into R packages or analysis pipelines. At the R layer, many of these are represented by simple wrapper functions, but all of these tasks are exposed at the C layer should users need to access them to perform tasks outside BiocMetal’s portfolio of user facing functions.

It can be difficult to write robust code that obscures steps from the user, particularly when devices and environments can be heterogeneous. So instead we’re choosing to ask the user to perform a few extra steps under the agreement that those steps should always be robust for them.

  1. Query devices - Find available Metal-compatible devices
  2. Create context - Initialize Metal context on a device
  3. Compile kernel - Compile Metal source code to executable
  4. Load library - Load compiled kernel into context
  5. Load kernel - Extract specific kernel function
  6. Manage device buffers - Appropriate buffer creation, transfer and management from device memory
  7. Execution - Run a kernel with appropriate object passing through compute layers, and return intelligible results to the user.

Step 1: Start up

BiocMetal uses an .onAttach hook to return a message to the user indicating whether or not a Metal capable device is present. It should return a message indicating that one is not present if that is the case.

library(BiocMetal)
# BiocMetal 0.5.2 - Metal GPU acceleration is available
#   Device: Apple M4
# Supported precisions: FP16, FP32, FP64

BiocMetal Kernels

There’s no magic here, kernels are just more code that’s a mix of general math and process optimization, however there is a fair amount of housekeeping in BiocMetal that is intended to make usage straightforward. The argument structures for metal_run and C_metal_run have an combination of rigidity and flexibility to them to fit within the abstraction path we have to pass through. A header file containing some housekeeping tools is also present within the same directory that kernels are packaged into. Ultimately, metal_run should be able to run most any relatively simple kernel that a user can supply as long as it respects that overhead argument structure.

# a simple kernel
cat(paste(readLines(system.file("kernels",
                                "vector_add.metal",
                                package = "BiocMetal")),
          "\n"))
# // BiocMetal's type management and other helpers 
#  #include "biocmetal_metal_header.h" 
#   
#  // kernels shipped with BiocMetal use a placeholder such as 'FloatType' 
#  // that is managed by the preprocessor directive set up in "biocmetal_metal_header.h" 
#   
#  // Vector addition: out[i] = a[i] + b[i] 
#  // Buffer layout for metal_execute_kernel: 
#  // buffer(0) = output 
#  // buffer(1) = output_length 
#  // buffer(2+) = inputs 
#   
#  kernel void vector_add( 
#      device FloatType* out [[buffer(0)]], 
#      constant uint& n [[buffer(1)]], 
#      device const FloatType* a [[buffer(2)]], 
#      device const FloatType* b [[buffer(3)]], 
#      uint id [[thread_position_in_grid]]) 
#  { 
#      if (id < n) { 
#          out[id] = a[id] + b[id]; 
#      } 
#  }

A basic plotting routine has been functionalized for this document because all the operational examples here have roughly the same form. You can click below to view the function.

Plotting function for this document

err_and_perf_plotting <- function(benchmarking_result,
                                  val_range) {
  # plots a side-by-side, to make this not look silly, make sure to set
  # plot size to 3.5 x 7 -- plot by default will fill a *square* graphics device
  # unless otherwise specified
  
  # collect the things we care about
  plot_vals01 <- vapply(X = res1,
                        FUN = function(x) {
                          mean(x$expr1_elapsed)
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  plot_vals02 <- vapply(X = res1,
                        FUN = function(x) {
                          mean(x$expr2_elapsed)
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  plot_vals03 <- vapply(X = res1,
                        FUN = function(x) {
                          x$err_vals$rmse
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  
  # set the layout and plot params
  par(mar = c(4,3,3,1),
      mgp = c(2.25, .75, 0))
  layout(mat = matrix(data = c(1,2),
                      nrow = 1,
                      ncol = 2))
  
  # call plot
  # # only log y axis in cases where it makes sense
  if (any(plot_vals01 < 1e-5) | any(plot_vals02 < 1e-5)) {
    log_val <- "x"
  } else {
    log_val <- "xy"
  }
  plot(x = 10^val_range,
       y = plot_vals01,
       xlab = "dim / vec size",
       ylab = "elapsed time",
       ylim = range(c(plot_vals01,
                      plot_vals02)),
       log = log_val,
       pch = 16,
       col = rgb(250, 50, 0, 150, maxColorValue = 255),
       main = "performance by time")
  points(x = 10^val_range,
         y = plot_vals02,
         pch = 16,
         col = rgb(0, 50, 250, 150, maxColorValue = 255))
  legend("topleft",
         cex = 0.75,
         pch = 16,
         col = c(rgb(250, 50, 0, 150, maxColorValue = 255),
                 rgb(0, 50, 250, 150, maxColorValue = 255)),
         legend = c("expr1",
                    "expr2"))
  
  plot(x = 10^val_range,
       y = plot_vals03,
       xlab = "dim / vec size",
       ylab = "rmse",
       ylim = range(plot_vals03),
       log = "x",
       pch = 16,
       col = rgb(50, 250, 50, 200, maxColorValue = 255),
       main = "rmse vs data size")
}

Step 2: Device information

Currently modern Silicon hardware only comes with a single Metal capable device, but legacy Intel hardware can have multiple Metal capable devices, and the capability to select devices is prudent for future compatability. BiocMetal contains several helper functions for querying information about devices:

# return the default metal device
# on modern silicon hardware this will just be unified CPU / GPU identifier
# on legacy hardware this will likely just be the CPU
metal_device_name()
# [1] "Apple M4"

# list all Metal-compatible devices
devices <- metalDevices()
print(devices)
# [[1]]
# <pointer: 0x11a852800>
# attr(,"class")
# [1] "metalDevice"

# devices will be a list, we can select from that list and query
dv01 <- devices[[1]]

# get detailed info about first device
device_info <- metalDeviceInfo(device = dv01)
str(device_info)
# List of 13
#  $ name                       : chr "Apple M4"
#  $ registry.id                : num 4.29e+09
#  $ unified.memory             : logi TRUE
#  $ recommended.max.working.set: num 1.72e+10
#  $ current.allocated          : num 81920
#  $ max.buffer.length          : num 1.29e+10
#  $ max.threads.per.threadgroup: int [1:3] 1024 1024 1024
#  $ max.threadgroup.memory     : int 32768
#  $ supports.raytracing        : logi TRUE
#  $ supports.function.pointers : logi TRUE
#  $ supports.fp16              : logi TRUE
#  $ supports.fp64              : logi TRUE
#  $ supports.int64             : logi TRUE

# show precision levels supported by the specified device
metal_precision_support(device = dv01)
# fp16 fp32 fp64 
# TRUE TRUE TRUE

# show the capabilities of a device
metal_capabilities(device = dv01)
# Metal Device Capabilities
# =========================
# Device:           Apple M4 
# Unified Memory:   Yes 
# Max Buffer Size:  12.88  GB
# Max Threads:      1024 x 1024 x 1024 
# 
# Precision Support:
#   FP16 (half):    Yes 
#   FP32 (float):   Yes (always) 
#   FP64 (double):  Yes

# set the device context
ctx <- metalContext(device = dv01)

Step 2: Get kernels

Using the explicit kernel and library loading functions in BiocMetal looks a little something like this. There are some counter intuitive arguments here partially in the hopes of building in forward compatibility and flexibility. Given a filepath to a .metal kernel file we can compile that kernel into a .metallib. The compile_simple_metallib function has a return value that is simply a vector of the compiler_directive flag that is the length of the number of kernels being compiled, which in the case of this code chunk is just 1. For now this vector is kind of boring, but in a scenario where a library may contain kernels with different precision specifications this will hopefully be useful.

metal_library_load loads the library onto the memory of the selected device context, while metal_kernel_load then specifically loads in the kernel and its work pipeline.

single_kernel_path <- system.file("kernels/vector_add.metal",
                                  package = "BiocMetal")
tmp01 <- tempfile()
nts01 <- compile_simple_metallib(infiles = single_kernel_path,
                                 outfile = tmp01,
                                 compiler_directive = "single")
lib <- metal_library_load(context = ctx,
                          filepath = tmp01,
                          compilation_notes = nts01)
# kernel is a pointer
kernel_single <- metal_kernel_load(lib = lib,
                                   kernel_name = "vector_add")

There is a convenience wrapper function in BiocMetal that takes care of those three explicit steeps and returns a list of pointers to kernels that have been loaded onto the memory of the selected device context.

# a list of pointers
kernel_list <- adhoc_compilation_and_loading(context = ctx,
                                             kernels = list.files(path = system.file("kernels",
                                                                                     package = "BiocMetal"),
                                                                  pattern = "^vector_"),
                                             compiler_directive = "single",
                                             NOTE = FALSE)

Calling xcrun explicitly, we still need to tie in to BiocMetal’s wrappers at the library and kernel loading steps.

files01 <- system.file("kernels",
                       "vector_add.metal",
                       package = "BiocMetal")
tmp02 <- tempfile()

system2(command = "xcrun",
        args = paste("-sdk macosx metal",
                     files01,
                     "-o",
                     tmp02,
                     "-DPRECISION_SINGLE"))


nts02 <- "float"
names(nts02) <- "vector_add"
lib02 <- metal_library_load(context = ctx,
                            filepath = tmp02,
                            compilation_notes = nts02)

# a pointer
kernel_manual <- metal_kernel_load(lib = lib02,
                                   kernel_name = "vector_add")

krn_notes <- adhoc_kernel_type_inference(kernel_file = files01)

# FloatType here is a placeholder used by a compiler directive to adjust kernels
# for half, float, or double precision
krn_notes
#   add_space arg_pos  arg_type read_only
# 1    device       0 FloatType     FALSE
# 2  constant       1      uint      TRUE
# 3    device       2 FloatType      TRUE
# 4    device       3 FloatType      TRUE

krn_types <- krn_notes$arg_type
krn_types[krn_types == "FloatType"] <- attr(x = kernel_manual,
                                            which = "metalMode")

Step 3: Running kernels

This package currently only contains a single metal_run function that is designed to execute a single kernel operation on a device context and return a single output buffer into R’s memory. Future execution schemes will hopefully be added to expand the utility and flexibility of the package. metal_run’s first two arguments have strict requirements, the first, kernel must be a kernel object (the pointer returned by metal_kernel_load), the second must be a character vector detailing the Metal types the following arguments will be converted to as they pass through the compute layers. After that, R objects representing the arguments that are detailed in the kernel’s function declaration must be supplied in matching order.

vec_size <- 10^6
vals01 <- runif(vec_size)
vals02 <- runif(vec_size)

res1 <- vals01 + vals02
res2 <- res3 <- res4 <- vector(mode = "numeric",
                               length = vec_size)
res2 <- metal_run(kernel = kernel_single,
                  arg_types = krn_types,
                  res2,
                  vec_size,
                  vals01,
                  vals02)
res3 <- metal_run(kernel = kernel_list$vector_add,
                  arg_types = krn_types,
                  res3,
                  vec_size,
                  vals01,
                  vals02)
res4 <- metal_run(kernel = kernel_manual,
                  arg_types = krn_types,
                  res4,
                  vec_size,
                  vals01,
                  vals02)

all.equal(res1, res2)
# [1] "Mean relative difference: 2.709564e-08"
all.equal(res1, res3)
# [1] "Mean relative difference: 2.709564e-08"
all.equal(res1, res4)
# [1] "Mean relative difference: 2.709564e-08"

Benchmarking things

Benchmarking vector addition shouldn’t contain any surprises, a naive GPU implementation is going to be outperformed by R’s builtin implementation on reasonable vector sizes. Most R builtins are highly optimized, though the likely culprit for our GPU implementation’s under performance is the overhead of handing the kernel and buffers over to the GPU memory space.

# past 1e8 memory allocations on laptops can begin to struggle, and there's
# no point pushing the margins in a vignette
val_range <- 3:7
pBar <- txtProgressBar(style = 1)
PBAR <- length(val_range)
res1 <- vector(mode = "list",
               length = PBAR)
tstart <- Sys.time()
for (a1 in seq_along(val_range)) {
  vec_size <- 10^val_range[a1]
  vals01 <- runif(vec_size)
  vals02 <- runif(vec_size)
  dummy_vector <- vector(mode = "numeric",
                         length = vec_size)
  
  res1[[a1]] <- adhoc_benchmarking_w_numerics(expr1 = vals01 + vals02,
                                              expr2 = metal_run(kernel = kernel_single,
                                                                arg_types = krn_types,
                                                                dummy_vector,
                                                                vec_size,
                                                                vals01,
                                                                vals02),
                                              refresh_vars = list("vals01" = function() runif(vec_size),
                                                                  "vals02" = function() runif(vec_size)),
                                              verbose = TRUE)
  setTxtProgressBar(pb = pBar,
                    value = a1 / PBAR)
}
# ================================================================================
tend <- Sys.time()
close(pBar)
print(tend - tstart)
# Time difference of 8.216414 secs

err_and_perf_plotting(benchmarking_result = res1,
                      val_range = val_range)

We can begin to see where our GPU implementations gain an advantage with benchmarking of matrix multiplication implementations, even naive ones.

# Get kernel
kernel_mm <- adhoc_compilation_and_loading(context = ctx,
                                           kernels = "matrix_ops_naive_mm.metal",
                                           compiler_directive = "single",
                                           NOTE = FALSE)

krn_notes <- adhoc_kernel_type_inference(kernel_file = system.file("kernels/matrix_ops_naive_mm.metal",
                                                                   package = "BiocMetal"))

krn_types <- krn_notes$arg_type
krn_types[krn_types == "FloatType"] <- attr(x = kernel_mm[[1]],
                                            which = "metalMode")

# Note: We use square matrices for simplicity
# For testing general rectangular matrices, modify dimensions accordingly
val_range <- 2:4  # Testing with smaller matrices due to O(n^3) complexity
pBar <- txtProgressBar(style = 1)
PBAR <- length(val_range)
res1 <- vector(mode = "list",
               length = PBAR)

tstart <- Sys.time()
for (a1 in seq_along(val_range)) {
  dims01 <- 9^val_range[a1]  # Square matrices n x n
  mat_vec <- dims01 * dims01
  
  # Create test matrices
  mat_A <- matrix(rnorm(mat_vec,
                        mean = 0,
                        sd = 1),
                  nrow = dims01,
                  ncol = dims01)
  mat_B <- matrix(rnorm(mat_vec,
                        mean = 0,
                        sd = 1),
                  nrow = dims01,
                  ncol = dims01)
  
  # Flatten matrices for Metal (row-major order)
  vals01 <- as.vector(mat_A)
  vals02 <- as.vector(mat_B)
  dummy_vector <- vector(mode = "numeric",
                         length = mat_vec)
  
  res1[[a1]] <- adhoc_benchmarking_w_numerics(expr1 = mat_A %*% mat_B,
                                              expr2 = {
                                                int_res <- metal_run(kernel = kernel_mm[[1]],
                                                                     arg_types = krn_types,
                                                                     dummy_vector,
                                                                     mat_vec,
                                                                     vals01,
                                                                     vals02,
                                                                     c(dims01, dims01, dims01))
                                                mat_res <- matrix(data = int_res,
                                                                  nrow = dims01,
                                                                  ncol = dims01)
                                              },
                                              refresh_vars = list("mat_A" = function() {
                                                matrix(rnorm(mat_vec,
                                                             mean = 0,
                                                             sd = 1),
                                                       nrow = dims01,
                                                       ncol = dims01)
                                              },
                                              "mat_B" = function() {
                                                matrix(rnorm(mat_vec,
                                                             mean = 0,
                                                             sd = 1),
                                                       nrow = dims01,
                                                       ncol = dims01)
                                              },
                                              "vals01" = function() {
                                                as.vector(mat_A)
                                              },
                                              "vals02" = function() {
                                                as.vector(mat_B)
                                              }),
                                              verbose = FALSE)
  
  setTxtProgressBar(pb = pBar,
                    value = a1 / PBAR)
}
# ================================================================================
tend <- Sys.time()
close(pBar)
print(tend - tstart)
# Time difference of 2.157735 mins

err_and_perf_plotting(benchmarking_result = res1,
                      val_range = val_range)

Other notes

Under construction

This package is not a direct responsibility for me currently, though I am making a considerable effort to put time into it when I can. The long term goal here is to create robust infrastructure that folks can rely on to incorporate Metal acceleration into their own code bases. Maybe that’s a fool’s errand, maybe its not, but I just got tired of there seeming to be a lack of the ability to do GPU and hardware accelerated optimizations in natively in R.

There is an extensive to-do list for this package, including, but not limited to:

  • tools for loading .safetensors formatted files as well as some open-source model formats
  • implementing kernel -> kernel compute pipelines
  • expansion of the current kernel library
  • some function naming convention adjustments under the hood
  • at least one more runner function for kernel execution that allows for more kernel complexity

Half precision

C does not natively support half precision (to the best of my knowledge). There are a few ways to handle this complication:

Roll my own

  • Pro: I would have control
  • Con: I would be responsible

Import a header library

  • Pro: This exists as an NSF funded academic project
  • Con: project still appears to be evolving (and that’s fine!), but this creates a point for potential maintenance and management complications
  • Pro: implementing fp16 in packages for other GPU frameworks should be relatively similar.

Let Metal handle it

  • Pro: Metal has baked in tools to do this
  • Con: Creates some extra complications and challenges in the paradigm through which type management occurs through the different compute layers (R -> C -> ObjC -> Metal)

Current choice: Import a header library.

If this choice changes an attempt will be made to reflect that in the documentation. Currently the fp16 headers are included from a static collection (december 2025). As this package matures an attempt will be made to ensure that this changes to a dynamic collection of the most recent stable headers around Bioc releases.

Double precision

This is supposed to work out of the box, but it currently doesn’t, so only half and float are implemented.

Session info

sessionInfo()
# R version 4.5.1 (2025-06-13)
# Platform: aarch64-apple-darwin20
# Running under: macOS Sequoia 15.6
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: Europe/Dublin
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] BiocMetal_0.5.2
# 
# loaded via a namespace (and not attached):
#  [1] compiler_4.5.1  fastmap_1.2.0   cli_3.6.5       tools_4.5.1    
#  [5] htmltools_0.5.9 yaml_2.3.11     rmarkdown_2.30  knitr_1.50     
#  [9] xfun_0.54       digest_0.6.39   rlang_1.1.6     evaluate_1.0.5

About

Infrastructure for calling Apple's Metal Framework from R

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published