BiocMetal

Nicholas Cooley 2025-12-22

Overview

Infrastructure for calling Apple’s Metal framework from R

This package is still under construction

Overview

BiocMetal provides minimal and robust infrastructure for integrating Apple’s Metal Framework into R code bases. Metal is Apple’s GPU computing framework; similar to CUDA (NVIDIA), or OpenCL (cross-platform). It is optimized specifically for Apple Silicon and is (allegedly) compatible with older Intel-based Macs.

This vignette introduces the basic workflow for using BiocMetal to integrate simple and slightly more than simple Metal acceleration into R functions. This package also contains a small library of simple kernels. Because Metal’s API is in Obj-C development here can be glacial in pace at times, but the long term goal is for this package to serve as a lightweight and robust dependency for developing and deploying code in Bioconductor that can reliably leverage GPUs (or other non-CPU devices) to accelerate complex compute tasks.

System Requirements

macOS 10.13 or later
Metal-compatible GPU (all modern Macs - some older ones)
R 4.5.0 or later

General package philophy and workflow

The BiocMetal workflow follows a paradigm similar to Simon Urbanek’s OpenCL R package. User exposed R and C functions follow the same thematic format. But because Metal’s API is in Obj-C, this requires an extra layer of abstraction under the hood. One up front difference is that OpenCL performs runtime compilation of kernels, while Metal seems to expect, but does not require, buildtime compilation. Because users have access to either paradigm and they both have their place in development and deployment, functions for kernel compilation are standalone and can be used flexibly for compilation at buildtime or runtime - BiocMetal does not compile its shipped kernels upon installation or loading into the namespace, or ship compiled kernels.

Non-comprehensive list of package responsibilities

Formally, BiocMetal needs to perform all items in the below list to serve as infrastructure to users for either on the fly analysis and testing, or formal/informal integration into R packages or analysis pipelines. At the R layer, many of these are represented by simple wrapper functions, but all of these tasks are exposed at the C layer should users need to access them to perform tasks outside BiocMetal’s portfolio of user facing functions.

It can be difficult to write robust code that obscures steps from the user, particularly when devices and environments can be heterogeneous. So instead we’re choosing to ask the user to perform a few extra steps under the agreement that those steps should always be robust for them.

Query devices - Find available Metal-compatible devices
Create context - Initialize Metal context on a device
Compile kernel - Compile Metal source code to executable
Load library - Load compiled kernel into context
Load kernel - Extract specific kernel function
Manage device buffers - Appropriate buffer creation, transfer and management from device memory
Execution - Run a kernel with appropriate object passing through compute layers, and return intelligible results to the user.

Step 1: Start up

BiocMetal uses an .onAttach hook to return a message to the user indicating whether or not a Metal capable device is present. It should return a message indicating that one is not present if that is the case.

library(BiocMetal)
# BiocMetal 0.5.2 - Metal GPU acceleration is available
#   Device: Apple M4
# Supported precisions: FP16, FP32, FP64

BiocMetal Kernels

There’s no magic here, kernels are just more code that’s a mix of general math and process optimization, however there is a fair amount of housekeeping in BiocMetal that is intended to make usage straightforward. The argument structures for metal_run and C_metal_run have an combination of rigidity and flexibility to them to fit within the abstraction path we have to pass through. A header file containing some housekeeping tools is also present within the same directory that kernels are packaged into. Ultimately, metal_run should be able to run most any relatively simple kernel that a user can supply as long as it respects that overhead argument structure.

# a simple kernel
cat(paste(readLines(system.file("kernels",
                                "vector_add.metal",
                                package = "BiocMetal")),
          "\n"))
# // BiocMetal's type management and other helpers 
#  #include "biocmetal_metal_header.h" 
#   
#  // kernels shipped with BiocMetal use a placeholder such as 'FloatType' 
#  // that is managed by the preprocessor directive set up in "biocmetal_metal_header.h" 
#   
#  // Vector addition: out[i] = a[i] + b[i] 
#  // Buffer layout for metal_execute_kernel: 
#  // buffer(0) = output 
#  // buffer(1) = output_length 
#  // buffer(2+) = inputs 
#   
#  kernel void vector_add( 
#      device FloatType* out [[buffer(0)]], 
#      constant uint& n [[buffer(1)]], 
#      device const FloatType* a [[buffer(2)]], 
#      device const FloatType* b [[buffer(3)]], 
#      uint id [[thread_position_in_grid]]) 
#  { 
#      if (id < n) { 
#          out[id] = a[id] + b[id]; 
#      } 
#  }

A basic plotting routine has been functionalized for this document because all the operational examples here have roughly the same form. You can click below to view the function.

Plotting function for this document

err_and_perf_plotting <- function(benchmarking_result,
                                  val_range) {
  # plots a side-by-side, to make this not look silly, make sure to set
  # plot size to 3.5 x 7 -- plot by default will fill a *square* graphics device
  # unless otherwise specified
  
  # collect the things we care about
  plot_vals01 <- vapply(X = res1,
                        FUN = function(x) {
                          mean(x$expr1_elapsed)
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  plot_vals02 <- vapply(X = res1,
                        FUN = function(x) {
                          mean(x$expr2_elapsed)
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  plot_vals03 <- vapply(X = res1,
                        FUN = function(x) {
                          x$err_vals$rmse
                        },
                        FUN.VALUE = vector(mode = "numeric",
                                           length = 1L))
  
  # set the layout and plot params
  par(mar = c(4,3,3,1),
      mgp = c(2.25, .75, 0))
  layout(mat = matrix(data = c(1,2),
                      nrow = 1,
                      ncol = 2))
  
  # call plot
  # # only log y axis in cases where it makes sense
  if (any(plot_vals01 < 1e-5) | any(plot_vals02 < 1e-5)) {
    log_val <- "x"
  } else {
    log_val <- "xy"
  }
  plot(x = 10^val_range,
       y = plot_vals01,
       xlab = "dim / vec size",
       ylab = "elapsed time",
       ylim = range(c(plot_vals01,
                      plot_vals02)),
       log = log_val,
       pch = 16,
       col = rgb(250, 50, 0, 150, maxColorValue = 255),
       main = "performance by time")
  points(x = 10^val_range,
         y = plot_vals02,
         pch = 16,
         col = rgb(0, 50, 250, 150, maxColorValue = 255))
  legend("topleft",
         cex = 0.75,
         pch = 16,
         col = c(rgb(250, 50, 0, 150, maxColorValue = 255),
                 rgb(0, 50, 250, 150, maxColorValue = 255)),
         legend = c("expr1",
                    "expr2"))
  
  plot(x = 10^val_range,
       y = plot_vals03,
       xlab = "dim / vec size",
       ylab = "rmse",
       ylim = range(plot_vals03),
       log = "x",
       pch = 16,
       col = rgb(50, 250, 50, 200, maxColorValue = 255),
       main = "rmse vs data size")
}

Step 2: Device information

Currently modern Silicon hardware only comes with a single Metal capable device, but legacy Intel hardware can have multiple Metal capable devices, and the capability to select devices is prudent for future compatability. BiocMetal contains several helper functions for querying information about devices:

# return the default metal device
# on modern silicon hardware this will just be unified CPU / GPU identifier
# on legacy hardware this will likely just be the CPU
metal_device_name()
# [1] "Apple M4"

# list all Metal-compatible devices
devices <- metalDevices()
print(devices)
# [[1]]
# <pointer: 0x11a852800>
# attr(,"class")
# [1] "metalDevice"

# devices will be a list, we can select from that list and query
dv01 <- devices[[1]]

# get detailed info about first device
device_info <- metalDeviceInfo(device = dv01)
str(device_info)
# List of 13
#  $ name                       : chr "Apple M4"
#  $ registry.id                : num 4.29e+09
#  $ unified.memory             : logi TRUE
#  $ recommended.max.working.set: num 1.72e+10
#  $ current.allocated          : num 81920
#  $ max.buffer.length          : num 1.29e+10
#  $ max.threads.per.threadgroup: int [1:3] 1024 1024 1024
#  $ max.threadgroup.memory     : int 32768
#  $ supports.raytracing        : logi TRUE
#  $ supports.function.pointers : logi TRUE
#  $ supports.fp16              : logi TRUE
#  $ supports.fp64              : logi TRUE
#  $ supports.int64             : logi TRUE

# show precision levels supported by the specified device
metal_precision_support(device = dv01)
# fp16 fp32 fp64 
# TRUE TRUE TRUE

# show the capabilities of a device
metal_capabilities(device = dv01)
# Metal Device Capabilities
# =========================
# Device:           Apple M4 
# Unified Memory:   Yes 
# Max Buffer Size:  12.88  GB
# Max Threads:      1024 x 1024 x 1024 
# 
# Precision Support:
#   FP16 (half):    Yes 
#   FP32 (float):   Yes (always) 
#   FP64 (double):  Yes

# set the device context
ctx <- metalContext(device = dv01)

Step 2: Get kernels

Using the explicit kernel and library loading functions in BiocMetal looks a little something like this. There are some counter intuitive arguments here partially in the hopes of building in forward compatibility and flexibility. Given a filepath to a .metal kernel file we can compile that kernel into a .metallib. The compile_simple_metallib function has a return value that is simply a vector of the compiler_directive flag that is the length of the number of kernels being compiled, which in the case of this code chunk is just 1. For now this vector is kind of boring, but in a scenario where a library may contain kernels with different precision specifications this will hopefully be useful.

metal_library_load loads the library onto the memory of the selected device context, while metal_kernel_load then specifically loads in the kernel and its work pipeline.

single_kernel_path <- system.file("kernels/vector_add.metal",
                                  package = "BiocMetal")
tmp01 <- tempfile()
nts01 <- compile_simple_metallib(infiles = single_kernel_path,
                                 outfile = tmp01,
                                 compiler_directive = "single")
lib <- metal_library_load(context = ctx,
                          filepath = tmp01,
                          compilation_notes = nts01)
# kernel is a pointer
kernel_single <- metal_kernel_load(lib = lib,
                                   kernel_name = "vector_add")

There is a convenience wrapper function in BiocMetal that takes care of those three explicit steeps and returns a list of pointers to kernels that have been loaded onto the memory of the selected device context.

# a list of pointers
kernel_list <- adhoc_compilation_and_loading(context = ctx,
                                             kernels = list.files(path = system.file("kernels",
                                                                                     package = "BiocMetal"),
                                                                  pattern = "^vector_"),
                                             compiler_directive = "single",
                                             NOTE = FALSE)

Calling xcrun explicitly, we still need to tie in to BiocMetal’s wrappers at the library and kernel loading steps.

files01 <- system.file("kernels",
                       "vector_add.metal",
                       package = "BiocMetal")
tmp02 <- tempfile()

system2(command = "xcrun",
        args = paste("-sdk macosx metal",
                     files01,
                     "-o",
                     tmp02,
                     "-DPRECISION_SINGLE"))


nts02 <- "float"
names(nts02) <- "vector_add"
lib02 <- metal_library_load(context = ctx,
                            filepath = tmp02,
                            compilation_notes = nts02)

# a pointer
kernel_manual <- metal_kernel_load(lib = lib02,
                                   kernel_name = "vector_add")

krn_notes <- adhoc_kernel_type_inference(kernel_file = files01)

# FloatType here is a placeholder used by a compiler directive to adjust kernels
# for half, float, or double precision
krn_notes
#   add_space arg_pos  arg_type read_only
# 1    device       0 FloatType     FALSE
# 2  constant       1      uint      TRUE
# 3    device       2 FloatType      TRUE
# 4    device       3 FloatType      TRUE

krn_types <- krn_notes$arg_type
krn_types[krn_types == "FloatType"] <- attr(x = kernel_manual,
                                            which = "metalMode")

Step 3: Running kernels

This package currently only contains a single metal_run function that is designed to execute a single kernel operation on a device context and return a single output buffer into R’s memory. Future execution schemes will hopefully be added to expand the utility and flexibility of the package. metal_run’s first two arguments have strict requirements, the first, kernel must be a kernel object (the pointer returned by metal_kernel_load), the second must be a character vector detailing the Metal types the following arguments will be converted to as they pass through the compute layers. After that, R objects representing the arguments that are detailed in the kernel’s function declaration must be supplied in matching order.

vec_size <- 10^6
vals01 <- runif(vec_size)
vals02 <- runif(vec_size)

res1 <- vals01 + vals02
res2 <- res3 <- res4 <- vector(mode = "numeric",
                               length = vec_size)
res2 <- metal_run(kernel = kernel_single,
                  arg_types = krn_types,
                  res2,
                  vec_size,
                  vals01,
                  vals02)
res3 <- metal_run(kernel = kernel_list$vector_add,
                  arg_types = krn_types,
                  res3,
                  vec_size,
                  vals01,
                  vals02)
res4 <- metal_run(kernel = kernel_manual,
                  arg_types = krn_types,
                  res4,
                  vec_size,
                  vals01,
                  vals02)

all.equal(res1, res2)
# [1] "Mean relative difference: 2.709564e-08"
all.equal(res1, res3)
# [1] "Mean relative difference: 2.709564e-08"
all.equal(res1, res4)
# [1] "Mean relative difference: 2.709564e-08"

Benchmarking things

Benchmarking vector addition shouldn’t contain any surprises, a naive GPU implementation is going to be outperformed by R’s builtin implementation on reasonable vector sizes. Most R builtins are highly optimized, though the likely culprit for our GPU implementation’s under performance is the overhead of handing the kernel and buffers over to the GPU memory space.

# past 1e8 memory allocations on laptops can begin to struggle, and there's
# no point pushing the margins in a vignette
val_range <- 3:7
pBar <- txtProgressBar(style = 1)
PBAR <- length(val_range)
res1 <- vector(mode = "list",
               length = PBAR)
tstart <- Sys.time()
for (a1 in seq_along(val_range)) {
  vec_size <- 10^val_range[a1]
  vals01 <- runif(vec_size)
  vals02 <- runif(vec_size)
  dummy_vector <- vector(mode = "numeric",
                         length = vec_size)
  
  res1[[a1]] <- adhoc_benchmarking_w_numerics(expr1 = vals01 + vals02,
                                              expr2 = metal_run(kernel = kernel_single,
                                                                arg_types = krn_types,
                                                                dummy_vector,
                                                                vec_size,
                                                                vals01,
                                                                vals02),
                                              refresh_vars = list("vals01" = function() runif(vec_size),
                                                                  "vals02" = function() runif(vec_size)),
                                              verbose = TRUE)
  setTxtProgressBar(pb = pBar,
                    value = a1 / PBAR)
}
# ================================================================================
tend <- Sys.time()
close(pBar)
print(tend - tstart)
# Time difference of 8.216414 secs

err_and_perf_plotting(benchmarking_result = res1,
                      val_range = val_range)

We can begin to see where our GPU implementations gain an advantage with benchmarking of matrix multiplication implementations, even naive ones.

# Get kernel
kernel_mm <- adhoc_compilation_and_loading(context = ctx,
                                           kernels = "matrix_ops_naive_mm.metal",
                                           compiler_directive = "single",
                                           NOTE = FALSE)

krn_notes <- adhoc_kernel_type_inference(kernel_file = system.file("kernels/matrix_ops_naive_mm.metal",
                                                                   package = "BiocMetal"))

krn_types <- krn_notes$arg_type
krn_types[krn_types == "FloatType"] <- attr(x = kernel_mm[[1]],
                                            which = "metalMode")

# Note: We use square matrices for simplicity
# For testing general rectangular matrices, modify dimensions accordingly
val_range <- 2:4  # Testing with smaller matrices due to O(n^3) complexity
pBar <- txtProgressBar(style = 1)
PBAR <- length(val_range)
res1 <- vector(mode = "list",
               length = PBAR)

tstart <- Sys.time()
for (a1 in seq_along(val_range)) {
  dims01 <- 9^val_range[a1]  # Square matrices n x n
  mat_vec <- dims01 * dims01
  
  # Create test matrices
  mat_A <- matrix(rnorm(mat_vec,
                        mean = 0,
                        sd = 1),
                  nrow = dims01,
                  ncol = dims01)
  mat_B <- matrix(rnorm(mat_vec,
                        mean = 0,
                        sd = 1),
                  nrow = dims01,
                  ncol = dims01)
  
  # Flatten matrices for Metal (row-major order)
  vals01 <- as.vector(mat_A)
  vals02 <- as.vector(mat_B)
  dummy_vector <- vector(mode = "numeric",
                         length = mat_vec)
  
  res1[[a1]] <- adhoc_benchmarking_w_numerics(expr1 = mat_A %*% mat_B,
                                              expr2 = {
                                                int_res <- metal_run(kernel = kernel_mm[[1]],
                                                                     arg_types = krn_types,
                                                                     dummy_vector,
                                                                     mat_vec,
                                                                     vals01,
                                                                     vals02,
                                                                     c(dims01, dims01, dims01))
                                                mat_res <- matrix(data = int_res,
                                                                  nrow = dims01,
                                                                  ncol = dims01)
                                              },
                                              refresh_vars = list("mat_A" = function() {
                                                matrix(rnorm(mat_vec,
                                                             mean = 0,
                                                             sd = 1),
                                                       nrow = dims01,
                                                       ncol = dims01)
                                              },
                                              "mat_B" = function() {
                                                matrix(rnorm(mat_vec,
                                                             mean = 0,
                                                             sd = 1),
                                                       nrow = dims01,
                                                       ncol = dims01)
                                              },
                                              "vals01" = function() {
                                                as.vector(mat_A)
                                              },
                                              "vals02" = function() {
                                                as.vector(mat_B)
                                              }),
                                              verbose = FALSE)
  
  setTxtProgressBar(pb = pBar,
                    value = a1 / PBAR)
}
# ================================================================================
tend <- Sys.time()
close(pBar)
print(tend - tstart)
# Time difference of 2.157735 mins

err_and_perf_plotting(benchmarking_result = res1,
                      val_range = val_range)

Other notes

Under construction

This package is not a direct responsibility for me currently, though I am making a considerable effort to put time into it when I can. The long term goal here is to create robust infrastructure that folks can rely on to incorporate Metal acceleration into their own code bases. Maybe that’s a fool’s errand, maybe its not, but I just got tired of there seeming to be a lack of the ability to do GPU and hardware accelerated optimizations in natively in R.

There is an extensive to-do list for this package, including, but not limited to:

tools for loading .safetensors formatted files as well as some open-source model formats
implementing kernel -> kernel compute pipelines
expansion of the current kernel library
some function naming convention adjustments under the hood
at least one more runner function for kernel execution that allows for more kernel complexity

Half precision

C does not natively support half precision (to the best of my knowledge). There are a few ways to handle this complication:

Roll my own

Pro: I would have control
Con: I would be responsible

Import a header library

Pro: This exists as an NSF funded academic project
Con: project still appears to be evolving (and that’s fine!), but this creates a point for potential maintenance and management complications
Pro: implementing fp16 in packages for other GPU frameworks should be relatively similar.

Let Metal handle it

Pro: Metal has baked in tools to do this
Con: Creates some extra complications and challenges in the paradigm through which type management occurs through the different compute layers (R -> C -> ObjC -> Metal)

Current choice: Import a header library.

If this choice changes an attempt will be made to reflect that in the documentation. Currently the fp16 headers are included from a static collection (december 2025). As this package matures an attempt will be made to ensure that this changes to a dynamic collection of the most recent stable headers around Bioc releases.

Double precision

This is supposed to work out of the box, but it currently doesn’t, so only half and float are implemented.

Session info

sessionInfo()
# R version 4.5.1 (2025-06-13)
# Platform: aarch64-apple-darwin20
# Running under: macOS Sequoia 15.6
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
# LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# time zone: Europe/Dublin
# tzcode source: internal
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] BiocMetal_0.5.2
# 
# loaded via a namespace (and not attached):
#  [1] compiler_4.5.1  fastmap_1.2.0   cli_3.6.5       tools_4.5.1    
#  [5] htmltools_0.5.9 yaml_2.3.11     rmarkdown_2.30  knitr_1.50     
#  [9] xfun_0.54       digest_0.6.39   rlang_1.1.6     evaluate_1.0.5

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
R		R
README_files/figure-gfm		README_files/figure-gfm
data		data
inst		inst
man		man
src		src
vignettes		vignettes
.BBSoptions		.BBSoptions
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
TODO.md		TODO.md
configure		configure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BiocMetal

Overview

System Requirements

General package philophy and workflow

Non-comprehensive list of package responsibilities

Step 1: Start up

BiocMetal Kernels

Step 2: Device information

Step 2: Get kernels

Step 3: Running kernels

Benchmarking things

Other notes

Under construction

Half precision

Double precision

Session info

About

Uh oh!

Releases

Packages

Languages

npcooley/BiocMetal

Folders and files

Latest commit

History

Repository files navigation

BiocMetal

Overview

System Requirements

General package philophy and workflow

Non-comprehensive list of package responsibilities

Step 1: Start up

BiocMetal Kernels

Step 2: Device information

Step 2: Get kernels

Step 3: Running kernels

Benchmarking things

Other notes

Under construction

Half precision

Double precision

Session info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages