Skip to content

feat: add initial ARM64 (aarch64) architecture support#1875

Draft
tomassrnka wants to merge 60 commits intomainfrom
arm64-support
Draft

feat: add initial ARM64 (aarch64) architecture support#1875
tomassrnka wants to merge 60 commits intomainfrom
arm64-support

Conversation

@tomassrnka
Copy link
Member

@tomassrnka tomassrnka commented Feb 10, 2026

Summary

Adds ARM64/aarch64 architecture support to the E2B infrastructure, enabling builds and sandbox execution on Apple Silicon and other ARM64 hosts (via Lima VM + nested KVM).

Changes by commit:

  1. Makefiles — Replace hardcoded GOARCH=amd64 and --platform linux/amd64 with $(shell go env GOARCH) across all 4 service Makefiles
  2. Go runtime detection — Disable SMT on ARM64 (not supported), use runtime.GOARCH for OCI image platform, add ARM64 fallback for CPU detection (gopsutil doesn't populate Family/Model on ARM)
  3. Provision script — Make chattr calls non-fatal (|| true) for busybox versions that lack it
  4. create-build — Arch-aware Firecracker and kernel download URLs (tries arm64/ subdirectory first, falls back to generic), E2B_BASE_IMAGE env var for base image override
  5. fetch-busybox — Makefile target to swap the embedded x86 busybox binary with the system's ARM64 busybox-static before compilation

Related PRs:

Test plan

  • Build orchestrator, envd, API, and client-proxy natively on ARM64 Linux
  • Run make fetch-busybox on ARM64 host to swap busybox binary
  • Template build succeeds with create-build on ARM64
  • Sandbox create/exec/delete works on ARM64 (Lima VM + KVM)
  • Verify uname -m in sandbox returns aarch64
  • Confirm no regression on x86_64 builds (all changes are backwards compatible)

🤖 Generated with Claude Code


Note

High Risk
High risk because it changes core sandbox/Firecracker startup and snapshot handling (including disabling seccomp on ARM64) and alters kernel/Firecracker path resolution and OCI platform selection, which can impact both security posture and runtime stability across architectures.

Overview
Adds initial ARM64 support end-to-end by introducing an ARM64 PR workflow (cross-compile + native arm runners) and runner setup script, making service build/publish Makefiles architecture-aware, and teaching the orchestrator to resolve Firecracker/kernel artifacts and OCI pulls by TARGET_ARCH with legacy fallbacks. It also adjusts runtime behavior for ARM64 (disable SMT, tweak UFFD write-protect usage, pass --no-seccomp for Firecracker on ARM64), hardens/deflakes several concurrency and hugepage-related tests, and updates template/rootfs provisioning to better handle clock-skewed APT, missing chattr, static network setup, and ext4 repair retries.

Written by Cursor Bugbot for commit 5491da7. This will update automatically on new commits. Configure here.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 589a0596cb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +28 to +37
// On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo.
// Provide fallback values so callers don't get an error.
if (family == "" || model == "") && runtime.GOARCH == "arm64" {
if family == "" {
family = "arm64"
}
if model == "" {
model = "0"
}
} else if family == "" || model == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make it cleaner

Suggested change
// On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo.
// Provide fallback values so callers don't get an error.
if (family == "" || model == "") && runtime.GOARCH == "arm64" {
if family == "" {
family = "arm64"
}
if model == "" {
model = "0"
}
} else if family == "" || model == "" {
// On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo.
// Provide fallback values so callers don't get an error.
if (runtime.GOARCH == "arm64") {
if family == "" {
family = "arm64"
}
if model == "" {
model = "0"
}
}
if family == "" || model == "" {

@tomassrnka tomassrnka marked this pull request as draft February 10, 2026 19:41
@tomassrnka
Copy link
Member Author

@cursoragent bugbot run

@cursor
Copy link

cursor bot commented Feb 10, 2026

Unable to authenticate your request. Please make sure to connect your GitHub account to Cursor. Go to Cursor

@tomassrnka
Copy link
Member Author

@cursoragent bugbot run

@tomassrnka
Copy link
Member Author

@claude review this draft PR


.PHONY: fetch-busybox
fetch-busybox:
@ARCH=$$(dpkg --print-architecture 2>/dev/null || echo "amd64"); \
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch-busybox target uses dpkg --print-architecture to detect the arch, but TargetArch() uses TARGET_ARCH env var. If a user sets TARGET_ARCH=arm64 on an amd64 host, this target will still use the amd64 busybox (since dpkg returns amd64), but the build will be for arm64, causing a binary mismatch.

if err := download(ctx, archURL, dstPath, 0o644); err == nil {
return nil
} else if !errors.Is(err, errNotFound) {
return fmt.Errorf("failed to download arm64 kernel: %w", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For arm64, if the arch-specific download (line 424) fails with a non-404 error (network issue, permissions, etc.), the code returns immediately without trying the legacy fallback. This differs from the Firecracker logic below which falls through to legacy on any 404. Consider consistent fallback behavior.

hugePages bool,
) error {
smt := true
smt := runtime.GOARCH != "arm64"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMT is disabled based on runtime.GOARCH (compile-time host arch), not TARGET_ARCH. If cross-compiling (TARGET_ARCH=arm64 on amd64 host), the binary will incorrectly enable SMT for ARM64 VMs. Should check the actual target arch or defer this to runtime detection.


echo "Making configuration immutable"
$BUSYBOX chattr +i /etc/resolv.conf
$BUSYBOX chattr +i /etc/resolv.conf 2>/dev/null || true
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silencing chattr failures could mask real issues beyond busybox compatibility. If chattr fails on a full-featured system due to filesystem type, permissions, or corrupted inodes, the script will continue silently. Consider checking if chattr exists first, or logging the failure reason.

@tomassrnka
Copy link
Member Author

@claude review this draft PR

tomassrnka and others added 29 commits March 18, 2026 11:14
1100 hugepages (2.2GB) + -race detector overhead OOMs the 16GB ARM64
runner (exit code 143/SIGTERM). Reduce to 256 (512MB) and make
NewPageMmap skip the test with t.Skip() when hugepage mmap fails with
ENOMEM, so tests that need more hugepages than available skip gracefully
instead of hard-failing or crashing the runner.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. oci_test.go: hardcoded "amd64" image architecture fails on ARM64
   runners. Use runtime.GOARCH so the test image matches the platform.

2. path_direct.go: deviceIndex captured by reference in a goroutine
   closure races with the outer loop reassigning it on retry. Capture
   into local variables (devIdx, sockIdx) before the goroutine.

Both are pre-existing bugs that only surface on ARM64 due to the weaker
memory model making race windows more likely to hit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `deviceIndex := uint32(math.MaxUint32)` to `var deviceIndex uint32`
since the initial value is always overwritten by GetDevice before any read.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd Statter

Replace *os.File with directory path string in statReq and statInDir to
avoid sharing a file handle between Scanner (which defers Close) and
Statter goroutines (which call Fd()). On context cancellation, the
deferred Close could fire while a Statter was still reading the fd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The branch protection rules require checks named "unit-tests / Run tests
for packages/..." so reverting the amd64-tests rename to avoid 8 pending
checks that never resolve.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures that merging the arm64-support branch without setting
TARGET_ARCH produces identical behavior to main (amd64). ARM64
is opt-in via TARGET_ARCH=arm64. Also normalizes common aliases
(x86_64 → amd64, aarch64 → arm64) and warns instead of panicking
on unrecognized values.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address PR review feedback:
- SMT: add constant for "arm64", add Firecracker docs reference
  explaining why SMT must be false on ARM processors
- nfs cleaner: document why dirPath is used instead of *os.File
  to avoid race between df.Close() and df.Fd()
- hugepage test: document why ENOMEM skip is needed on CI runners

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 4-core/16GB ubuntu-24.04-arm runner was getting killed by
the GitHub infrastructure during orchestrator tests with -race
(5-10x memory overhead). Upgrade to ubuntu-24.04-arm-8 for
8 vCPU / 32 GB RAM.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ubuntu-24.04-arm-8 runner is not provisioned in the org
(jobs stuck in queue). Revert to ubuntu-24.04-arm and add
timeout-minutes: 30 to prevent indefinite hangs when the
runner loses communication.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestGetPublicImageWithGeneralAuth was creating test images with
runtime.GOARCH but GetPublicImage validates against TargetArch()
(which now defaults to "amd64"). On ARM64 runners this caused
architecture mismatch. Use TargetArch() in the test to match
the production code path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mock's Return() shares one connect.Response across parallel
subtests, causing a race on lazily-initialized header/trailer maps.
Use RunAndReturn() to create a fresh response per invocation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After merging main, baseImage was hardcoded as a const, losing the
env var override needed for ARM64 (Docker Hub e2bdev/base:latest has
no ARM64 manifest, so a local registry image is used instead).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add BUILD_ARCH variable (defaults to `go env GOARCH`) to all service
  Makefiles so devs can cross-build for remote clusters with a different
  architecture (e.g., BUILD_ARCH=amd64 make build-and-upload).
- Replace dpkg-based arch detection in fetch-busybox with `go env GOARCH`
  fallback for better cross-platform support (non-Debian hosts).
- Add comments explaining why chattr is non-fatal in provision.sh
  (ARM64 busybox-static packages may omit chattr).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fetch-busybox was detecting arch independently via go env GOARCH,
which would mismatch when cross-compiling with BUILD_ARCH (e.g.,
BUILD_ARCH=amd64 on an ARM64 host would embed an ARM64 busybox
into an amd64 orchestrator binary).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add BUILD_PLATFORM variable (defaults to linux/$BUILD_ARCH) to all
service Makefiles. This allows building for multiple architectures
in a single Docker buildx invocation:

  BUILD_PLATFORM=linux/amd64,linux/arm64 make build-and-upload

Go builds still use BUILD_ARCH for single-arch compilation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TargetArch() defaults to runtime.GOARCH instead of hardcoded "amd64",
  so ARM64 hosts auto-detect without needing TARGET_ARCH env var
- fetch-busybox tries multiple methods (existing binary check, host
  busybox copy, apt download) instead of requiring apt/dpkg-deb
- setup-arm64-runner.sh is now idempotent — uses > instead of >> for
  config files, guards fstab append with grep

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoids leaving /tmp/chattr_err behind after provisioning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- machineinfo: separate ARM64 fallback from error check (jakubno suggestion)
- fetch-busybox: verify host busybox is statically linked before copying

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The multi-arch approach (building e2bdev/base:latest for both amd64
and arm64) is preferred over a runtime env var override.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The non-fatal chattr handling is unnecessary — the fetch-busybox
target ensures a proper busybox binary with chattr support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, build-debug on ARM64 embeds the wrong-architecture
(amd64) busybox binary, causing silent runtime failures in sandboxes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARM64-specific fixes discovered during local template build testing:

- fc/script_builder: disable seccomp on ARM64 (FC aarch64 seccomp filter
  lacks userfaultfd syscall, causing snapshot restore failure)
- uffd/userfaultfd: skip UFFDIO_COPY_MODE_WP on ARM64 (kernels < 6.10
  lack CONFIG_HAVE_ARCH_USERFAULTFD_WP)
- rcS.sh.tpl: add userspace network config fallback when kernel lacks
  CONFIG_IP_PNP (stock distro kernels don't have it)
- static-network.tpl: new systemd service for network setup during
  layer creation phase (same IP_PNP workaround for systemd boot)
- rootfs.go: enable setup-network.service in multi-user.target
- provision.sh: non-fatal chattr (busybox may lack it), APT date
  validation bypass (FC VM clock starts at epoch)
- builder.go + ext4.go: e2fsck force-fix fallback when preen mode
  fails (orphan inode lists after provision.sh self-deletion)
- layer/interfaces.go: increase envd wait timeout to 5min (ARM64
  boot is slower)
- envd.go: reduce retry log spam, add detailed init logging

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants