feat: add initial ARM64 (aarch64) architecture support#1875
feat: add initial ARM64 (aarch64) architecture support#1875tomassrnka wants to merge 60 commits intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 589a0596cb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
730d2d7 to
06f483f
Compare
8e7806e to
20baf1c
Compare
| // On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo. | ||
| // Provide fallback values so callers don't get an error. | ||
| if (family == "" || model == "") && runtime.GOARCH == "arm64" { | ||
| if family == "" { | ||
| family = "arm64" | ||
| } | ||
| if model == "" { | ||
| model = "0" | ||
| } | ||
| } else if family == "" || model == "" { |
There was a problem hiding this comment.
let's make it cleaner
| // On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo. | |
| // Provide fallback values so callers don't get an error. | |
| if (family == "" || model == "") && runtime.GOARCH == "arm64" { | |
| if family == "" { | |
| family = "arm64" | |
| } | |
| if model == "" { | |
| model = "0" | |
| } | |
| } else if family == "" || model == "" { | |
| // On ARM64, gopsutil doesn't populate Family/Model from /proc/cpuinfo. | |
| // Provide fallback values so callers don't get an error. | |
| if (runtime.GOARCH == "arm64") { | |
| if family == "" { | |
| family = "arm64" | |
| } | |
| if model == "" { | |
| model = "0" | |
| } | |
| } | |
| if family == "" || model == "" { |
04f0ec3 to
45e635f
Compare
|
@cursoragent bugbot run |
|
Unable to authenticate your request. Please make sure to connect your GitHub account to Cursor. Go to Cursor |
|
@cursoragent bugbot run |
|
@claude review this draft PR |
packages/orchestrator/Makefile
Outdated
|
|
||
| .PHONY: fetch-busybox | ||
| fetch-busybox: | ||
| @ARCH=$$(dpkg --print-architecture 2>/dev/null || echo "amd64"); \ |
There was a problem hiding this comment.
The fetch-busybox target uses dpkg --print-architecture to detect the arch, but TargetArch() uses TARGET_ARCH env var. If a user sets TARGET_ARCH=arm64 on an amd64 host, this target will still use the amd64 busybox (since dpkg returns amd64), but the build will be for arm64, causing a binary mismatch.
| if err := download(ctx, archURL, dstPath, 0o644); err == nil { | ||
| return nil | ||
| } else if !errors.Is(err, errNotFound) { | ||
| return fmt.Errorf("failed to download arm64 kernel: %w", err) |
There was a problem hiding this comment.
For arm64, if the arch-specific download (line 424) fails with a non-404 error (network issue, permissions, etc.), the code returns immediately without trying the legacy fallback. This differs from the Firecracker logic below which falls through to legacy on any 404. Consider consistent fallback behavior.
| hugePages bool, | ||
| ) error { | ||
| smt := true | ||
| smt := runtime.GOARCH != "arm64" |
There was a problem hiding this comment.
SMT is disabled based on runtime.GOARCH (compile-time host arch), not TARGET_ARCH. If cross-compiling (TARGET_ARCH=arm64 on amd64 host), the binary will incorrectly enable SMT for ARM64 VMs. Should check the actual target arch or defer this to runtime detection.
|
|
||
| echo "Making configuration immutable" | ||
| $BUSYBOX chattr +i /etc/resolv.conf | ||
| $BUSYBOX chattr +i /etc/resolv.conf 2>/dev/null || true |
There was a problem hiding this comment.
Silencing chattr failures could mask real issues beyond busybox compatibility. If chattr fails on a full-featured system due to filesystem type, permissions, or corrupted inodes, the script will continue silently. Consider checking if chattr exists first, or logging the failure reason.
|
@claude review this draft PR |
1100 hugepages (2.2GB) + -race detector overhead OOMs the 16GB ARM64 runner (exit code 143/SIGTERM). Reduce to 256 (512MB) and make NewPageMmap skip the test with t.Skip() when hugepage mmap fails with ENOMEM, so tests that need more hugepages than available skip gracefully instead of hard-failing or crashing the runner. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1. oci_test.go: hardcoded "amd64" image architecture fails on ARM64 runners. Use runtime.GOARCH so the test image matches the platform. 2. path_direct.go: deviceIndex captured by reference in a goroutine closure races with the outer loop reassigning it on retry. Capture into local variables (devIdx, sockIdx) before the goroutine. Both are pre-existing bugs that only surface on ARM64 due to the weaker memory model making race windows more likely to hit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Change `deviceIndex := uint32(math.MaxUint32)` to `var deviceIndex uint32` since the initial value is always overwritten by GetDevice before any read. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nd Statter Replace *os.File with directory path string in statReq and statInDir to avoid sharing a file handle between Scanner (which defers Close) and Statter goroutines (which call Fd()). On context cancellation, the deferred Close could fire while a Statter was still reading the fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The branch protection rules require checks named "unit-tests / Run tests for packages/..." so reverting the amd64-tests rename to avoid 8 pending checks that never resolve. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures that merging the arm64-support branch without setting TARGET_ARCH produces identical behavior to main (amd64). ARM64 is opt-in via TARGET_ARCH=arm64. Also normalizes common aliases (x86_64 → amd64, aarch64 → arm64) and warns instead of panicking on unrecognized values. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address PR review feedback: - SMT: add constant for "arm64", add Firecracker docs reference explaining why SMT must be false on ARM processors - nfs cleaner: document why dirPath is used instead of *os.File to avoid race between df.Close() and df.Fd() - hugepage test: document why ENOMEM skip is needed on CI runners Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 4-core/16GB ubuntu-24.04-arm runner was getting killed by the GitHub infrastructure during orchestrator tests with -race (5-10x memory overhead). Upgrade to ubuntu-24.04-arm-8 for 8 vCPU / 32 GB RAM. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ubuntu-24.04-arm-8 runner is not provisioned in the org (jobs stuck in queue). Revert to ubuntu-24.04-arm and add timeout-minutes: 30 to prevent indefinite hangs when the runner loses communication. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
TestGetPublicImageWithGeneralAuth was creating test images with runtime.GOARCH but GetPublicImage validates against TargetArch() (which now defaults to "amd64"). On ARM64 runners this caused architecture mismatch. Use TargetArch() in the test to match the production code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The mock's Return() shares one connect.Response across parallel subtests, causing a race on lazily-initialized header/trailer maps. Use RunAndReturn() to create a fresh response per invocation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After merging main, baseImage was hardcoded as a const, losing the env var override needed for ARM64 (Docker Hub e2bdev/base:latest has no ARM64 manifest, so a local registry image is used instead). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add BUILD_ARCH variable (defaults to `go env GOARCH`) to all service Makefiles so devs can cross-build for remote clusters with a different architecture (e.g., BUILD_ARCH=amd64 make build-and-upload). - Replace dpkg-based arch detection in fetch-busybox with `go env GOARCH` fallback for better cross-platform support (non-Debian hosts). - Add comments explaining why chattr is non-fatal in provision.sh (ARM64 busybox-static packages may omit chattr). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fetch-busybox was detecting arch independently via go env GOARCH, which would mismatch when cross-compiling with BUILD_ARCH (e.g., BUILD_ARCH=amd64 on an ARM64 host would embed an ARM64 busybox into an amd64 orchestrator binary). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add BUILD_PLATFORM variable (defaults to linux/$BUILD_ARCH) to all service Makefiles. This allows building for multiple architectures in a single Docker buildx invocation: BUILD_PLATFORM=linux/amd64,linux/arm64 make build-and-upload Go builds still use BUILD_ARCH for single-arch compilation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- TargetArch() defaults to runtime.GOARCH instead of hardcoded "amd64", so ARM64 hosts auto-detect without needing TARGET_ARCH env var - fetch-busybox tries multiple methods (existing binary check, host busybox copy, apt download) instead of requiring apt/dpkg-deb - setup-arm64-runner.sh is now idempotent — uses > instead of >> for config files, guards fstab append with grep Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Avoids leaving /tmp/chattr_err behind after provisioning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- machineinfo: separate ARM64 fallback from error check (jakubno suggestion) - fetch-busybox: verify host busybox is statically linked before copying Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The multi-arch approach (building e2bdev/base:latest for both amd64 and arm64) is preferred over a runtime env var override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The non-fatal chattr handling is unnecessary — the fetch-busybox target ensures a proper busybox binary with chattr support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without this, build-debug on ARM64 embeds the wrong-architecture (amd64) busybox binary, causing silent runtime failures in sandboxes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ARM64-specific fixes discovered during local template build testing: - fc/script_builder: disable seccomp on ARM64 (FC aarch64 seccomp filter lacks userfaultfd syscall, causing snapshot restore failure) - uffd/userfaultfd: skip UFFDIO_COPY_MODE_WP on ARM64 (kernels < 6.10 lack CONFIG_HAVE_ARCH_USERFAULTFD_WP) - rcS.sh.tpl: add userspace network config fallback when kernel lacks CONFIG_IP_PNP (stock distro kernels don't have it) - static-network.tpl: new systemd service for network setup during layer creation phase (same IP_PNP workaround for systemd boot) - rootfs.go: enable setup-network.service in multi-user.target - provision.sh: non-fatal chattr (busybox may lack it), APT date validation bypass (FC VM clock starts at epoch) - builder.go + ext4.go: e2fsck force-fix fallback when preen mode fails (orphan inode lists after provision.sh self-deletion) - layer/interfaces.go: increase envd wait timeout to 5min (ARM64 boot is slower) - envd.go: reduce retry log spam, add detailed init logging Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5491da7 to
f59d2e0
Compare
Summary
Adds ARM64/aarch64 architecture support to the E2B infrastructure, enabling builds and sandbox execution on Apple Silicon and other ARM64 hosts (via Lima VM + nested KVM).
Changes by commit:
GOARCH=amd64and--platform linux/amd64with$(shell go env GOARCH)across all 4 service Makefilesruntime.GOARCHfor OCI image platform, add ARM64 fallback for CPU detection (gopsutil doesn't populate Family/Model on ARM)chattrcalls non-fatal (|| true) for busybox versions that lack itarm64/subdirectory first, falls back to generic),E2B_BASE_IMAGEenv var for base image overrideRelated PRs:
Test plan
make fetch-busyboxon ARM64 host to swap busybox binarycreate-buildon ARM64uname -min sandbox returnsaarch64🤖 Generated with Claude Code
Note
High Risk
High risk because it changes core sandbox/Firecracker startup and snapshot handling (including disabling seccomp on ARM64) and alters kernel/Firecracker path resolution and OCI platform selection, which can impact both security posture and runtime stability across architectures.
Overview
Adds initial ARM64 support end-to-end by introducing an ARM64 PR workflow (cross-compile + native arm runners) and runner setup script, making service build/publish Makefiles architecture-aware, and teaching the orchestrator to resolve Firecracker/kernel artifacts and OCI pulls by
TARGET_ARCHwith legacy fallbacks. It also adjusts runtime behavior for ARM64 (disable SMT, tweak UFFD write-protect usage, pass--no-seccompfor Firecracker on ARM64), hardens/deflakes several concurrency and hugepage-related tests, and updates template/rootfs provisioning to better handle clock-skewed APT, missingchattr, static network setup, and ext4 repair retries.Written by Cursor Bugbot for commit 5491da7. This will update automatically on new commits. Configure here.