Build deploy binaries on Blacksmith with a Nix sticky disk by synoet · Pull Request #3768 · macro-inc/macro

synoet · 2026-06-04T16:19:40Z

What

Move the deploy service-binary builds onto Blacksmith runners and cache the Nix store on a Blacksmith sticky disk instead of relying solely on Cachix.

Design: a single warm-deps job builds the shared release dependency closure (deployCargoArtifacts) and commits it to the /nix sticky disk. The per-service build matrix then fans out, each job cloning the warm disk so the expensive shared deps are already in /nix/store and only the service's own crate compiles. This maps onto crane's existing layering — deployCargoArtifacts (deps-only) is already separate from the per-service buildPackage.

Changes

Production cutover (deploy-all-services.yml):

New warm-deps job → builds + commits shared deps to the sticky disk.
build-service-binaries matrix now runs on Blacksmith, mounts the same sticky disk (clones the warm snapshot), and builds per service. The build/upload step — including the nix-store closure copy the deploy consumes — is unchanged.
warm-deps failures surfaced in the deployment summary.
build-lambda-artifacts left on its existing runner for now (different cargo-lambda build path); can follow in a separate change.

Supporting:

flake.nix — expose deployCargoArtifacts as a package output (nix build .#deployCargoArtifacts).
.github/actions/setup-nix/action.yml — install Nix on a runner that doesn't ship it (Blacksmith), or re-initialise just the daemon/config/nixbld users when /nix is restored warm from a sticky disk.
.github/workflows/deploy-binaries-blacksmith-poc.yml — a standalone, non-deploying workflow_dispatch harness to validate the build/cache path in isolation, with a cold-disk toggle to A/B cold vs. warm builds. Safe to delete once confident.

Cachix kept as fallback

The sticky disk is the primary (L1) cache; Cachix stays wired as the fallback substituter (via setup-cachix), so a cold or evicted disk pulls prebuilt artifacts instead of compiling from source. warm-deps also still pushes to Cachix during migration. Dropping Cachix can be a later step.

⚠️ Requires Blacksmith provisioning — validate before merge

I could not execute this (no Nix/Blacksmith in my environment). Before this deploy path can run green:

Blacksmith app installed on macro-inc/macro, and the runner label blacksmith-8vcpu-ubuntu-2404 matches a provisioned pool (adjust if not).
useblacksmith/stickydisk allowed by org Actions policy.
Nix-on-warm-disk path in setup-nix is the fiddly bit — the store persists on the disk but the systemd unit, /etc/nix, and nixbld users don't, so the action recreates them. Most likely the step to need a tweak after the first real run.

Suggested rollout: run the standalone PoC workflow first to shake out 1–3 above without risking a deploy; then exercise deploy-all-services against dev. Because this is a direct cutover, if Blacksmith isn't fully wired up the dev/prod binary builds will fail until reverted — happy to add a toggle to fall back to the old linux-extra-beefy + Cachix path if you'd prefer a safety valve.

How to test

[PoC] Deploy Binaries on Blacksmith (workflow_dispatch) — first run = cold disk, re-run = warm; cold-disk input forces a comparison. Watch per-service build times to confirm deps aren't recompiling.
Then Deploy All Services against dev.

https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

Introduce a workflow_dispatch-only proof of concept that builds the deploy service binaries on Blacksmith runners, caching the Nix store on a sticky disk instead of relying solely on Cachix. - Expose deployCargoArtifacts as a flake package output so the shared release dependency closure can be built directly. - Add a setup-nix composite action that installs Nix (or re-initialises the daemon/config when /nix is restored warm from a sticky disk). - Add the PoC workflow: a single warm-deps job builds and commits the shared deps to the /nix sticky disk, then a per-service build matrix fans out, cloning the warm disk so only each service's own crate compiles. Cachix stays wired as a fallback substituter so a cold/evicted disk pulls prebuilt artifacts instead of compiling from source. Additive only; the production deploy path is untouched.

coderabbitai · 2026-06-04T16:19:47Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 93a5aedc-91a4-42ec-8407-e56d7ea1b46a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add a warm-deps job that builds the shared release dependency closure (deployCargoArtifacts) and commits it to the /nix sticky disk, then route the build-service-binaries matrix through Blacksmith so each parallel job clones the warm store and only compiles its own service crate. The build/upload step (including the nix-store closure copy the deploy consumes) is unchanged. Cachix stays wired as a fallback substituter, and warm-deps keeps pushing to it during migration. build-lambda-artifacts is left on its existing runner for now. Surfaces warm-deps failures in the deployment summary. Requires the Blacksmith app + runner pool to be provisioned; the runner label may need adjusting to the org's available labels.

Lambdas compile via cargo-lambda inside the dev shell (not crane), so they cannot reuse deployCargoArtifacts. Warm what they actually consume instead: - warm-deps now also realises the dev shell into the same /nix sticky disk (single committer, so no last-write-wins race), giving the lambda matrix an instant nix develop. - build-lambda-artifacts runs on Blacksmith, clones the warm /nix disk, and mounts a per-service cargo target sticky disk so compiled lambda deps stay warm across runs. Gated behind warm-deps. Timing: setup now emits filtered binaries/lambdas matrices so neither build job spins up Nix + a sticky disk for services that produce no such artifact (14 binary services, 12 lambda services vs all 20+). Both matrices fan out in parallel after the single warm-deps gate.

workflow_dispatch only works once a workflow is on the default branch, so add a push trigger scoped to the feature branch so the PoC runs directly from the branch. To be removed before merge.

build-cloud-storage-lambdas.sh ran the recipe via 'nix develop -c bash -lc'. The login shell re-sources /etc/profile and resets PATH, dropping the dev-shell tools (just, cargo-lambda) on runners without a system-wide install such as fresh Blacksmith images, causing 'just: command not found'. Use a non-login 'bash -c' so the nix develop environment carries through.

The warm-deps run built the deps + dev shell fine, but the Blacksmith sticky-disk commit failed with 'umount: /nix: target is busy': the Determinate Nix daemon runs out of /nix and holds the mount open, so the warmed store was never persisted (every run stayed cold). Add a teardown-nix composite action that stops nix-daemon and determinate-nixd (plus a fuser backstop) and run it as the final always() step of every Blacksmith job that mounts the /nix sticky disk, so the disk can unmount and commit the warmed store.

setup-cachix installed the cachix CLI via 'nix profile add nixpkgs#cachix', which Determinate resolves to an unpinned nixpkgs-weekly and re-fetches + re-evaluates (~20s) on every job, even when /nix is warm on the sticky disk. Install from the repo flake's pinned nixpkgs (already on the sticky disk) via --inputs-from, with a fallback to the registry path if that fails. Keeps all substituter/auth/push semantics identical; just removes the redundant nixpkgs-weekly fetch.

The PoC was triggered on push to the feature branch for validation. Drop it so the workflow is workflow_dispatch-only again and won't auto-run.

…deploys 1) deploy-all-services: drop max-parallel:8 on build-service-binaries so all binary services build concurrently instead of batching 8 (the lower half was waiting on the first 8). Each job only clones the warm /nix disk and compiles its own crate, so full fan-out is fine. 2) serviceLoadBalancer: the shared target group only set health-check path + protocol, inheriting AWS ALB defaults (interval 30s, healthyThreshold 5) so a new ECS task needed ~5x30s = 120-150s to register healthy - the dominant cost in the ~136s rollout. Tune to interval 10s / healthyThreshold 2 / timeout 5 / matcher 200 (~20s registration), parametrized with these as defaults so a service with an expensive /health endpoint can override them.

Blacksmith autoscales runners, so cap the lambda build matrix at nothing and let all lambda services build concurrently like the binary matrix.

deploy-services authenticates to AWS via explicit static keys (aws-actions/configure-aws-credentials with secrets.AWS_*), not the RunsOn EC2 instance role, so it has no ambient-credential dependency to lose. Swap the RunsOn runs-on array (runner/spot/hdd/run-id) for a Blacksmith label; Pulumi token, Datadog keys and AWS creds all come from GitHub secrets and are runner-independent. Keeps the max-parallel:20 deploy concurrency cap.

…son-LXyqN

Dockerfiles: replace 'COPY <bin> /app/svc' + 'RUN chmod +x' with a single 'COPY --chmod=755 …'. The old form wrote the binary into two layers (one without +x, one with), doubling its on-disk footprint in the image; --chmod sets perms during the copy in one layer. Applied to the prebuilt deploy images and the builder-stage production images (Dockerfile, Dockerfile.convert_service, Dockerfile.search_processing_service + their .prebuilt variants). Lambdas: set CARGO_PROFILE_RELEASE_OPT_LEVEL=2 (vs release default 3) for the cargo-lambda build only, trimming leaf-crate codegen time. Service binaries build via crane and are unaffected.

Let all services deploy concurrently like the build matrices.

Replace the GitHub Actions artifact upload/download between the build and deploy matrices with a Blacksmith sticky-disk handoff. The ~96MB prebuilt closure was moving at ~1.3MB/s over the GitHub artifact API (~70s per deploy job); routing it over Blacksmith's co-located NVMe snapshot fabric drops that to seconds. - build-service-binaries / build-lambda-artifacts: mount a per-service, per-SHA handoff disk and write the tar straight onto it; drop upload-artifact. - deploy-services: clone the matching handoff disk (gated on the artifact flags) and feed the on-disk tar to the deploy action. - deploy-cloud-storage-pulumi: add optional prebuilt-binaries-tar / lambda-artifacts-tar inputs that take precedence over (and skip) the artifact download. The artifact-name inputs are untouched, so the other callers (deploy-cloud-storage-on-push, reusable-deploy-service, deploy-pulumi-stack) keep working unchanged. Keys are <repo>-handoff-{binaries,lambdas}-<service>-<sha>: the Nix build is deploy-env-independent so same-SHA dev/prod runs are byte-identical (safe last-write-wins), different SHAs get distinct keys, and unused snapshots auto-evict after 7 days. A guarded chown makes the fresh ext4 mount writable on non-root runners. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

Each deploy job re-downloaded the AWS/Docker provider plugins into $PULUMI_HOME/plugins (~45s). Pin PULUMI_HOME to /pulumi (outside the workspace the deploy action's checkout cleans) and back its plugins subdir with a single stable-keyed sticky disk shared by every deploy job: first run downloads + commits, later runs clone it warm and skip the pull. Plugins are version-pinned by infra/ and identical across services, so the shared key with last-write-wins is safe. Kept in this workflow (not the shared composite) so non-Blacksmith callers are unaffected. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

Multi-handler services (document-storage-service has 3; email-service, bulk-upload, organization-retention have 2) were building their handlers serially -- the script looped `just <lambda>/build`, each a separate `cargo lambda build --bin <name>`. Backgrounding them wouldn't help: cargo holds an exclusive target-dir lock, so concurrent invocations just serialize. Build all of a service's handlers in a single `cargo lambda build --bin a --bin b ...` so cargo compiles the shared workspace deps once and parallelizes the leaf handler crates across the runner's cores. Single-handler services are unchanged (one --bin flag). Falls back to the per-lambda `just` recipe if the combined build fails, so it can only improve build time, never break a deploy. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

The check/test jobs run sccache (S3-backed) + rust-cache and monitor `sccache --show-stats`; the deploy lambda build wires the same sccache bucket and RUSTC_WRAPPER but never reports hit/miss, so we can't tell if it's caching the cargo-lambda/zigbuild compiles. Add a show-stats step (querying the same dev-shell sccache server, with AWS creds) so the next deploy reveals whether sccache is actually doing its job for lambdas. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

First step toward giving lambdas the same content-addressed nix cache the service binaries already get (instead of cargo-in-a-fresh-checkout, which recompiles the workspace every run because cargo keys path crates by mtime). Adds, modeled on deployServiceBinaryPackage: - lambdaCommonArgs: cargo-zigbuild as the builder, zig as the linker (drops the host-only mold arg), opt-level 2, and a preBuild that points zig's cache at $TMPDIR so it works in the read-only-$HOME sandbox. - lambdaDeployCargoArtifacts: a cached dep closure for the Lambda target, scoped with --package so the C-heavy service deps (pdfium, libreoffice) stay out. glibc is pinned purely via the target suffix (x86_64-unknown-linux-gnu.2.26, AL2; forward-compatible with al2023) -- host triple == lambda triple, so no extra rust-std. - deployLambdaPackage: builds one handler and emits the custom-runtime bootstrap.zip (zip whose single entry is `bootstrap`), mirroring cargo-lambda. Scoped to one lean handler (user_link_cleanup_handler) to validate the zig-in-sandbox + crane interaction before rolling out to all handlers and wiring CI to `nix build` it. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

workflow_dispatch-only job that runs `nix build .#deploy-lambda-<handler>` on a Blacksmith runner (x86_64 Linux, so the lambda triple == host triple — no cross-std needed, which is why this can't run on a macOS dev box). Mounts the warm /nix sticky disk, keeps Cachix as the fallback substituter, inspects the bootstrap.zip (contents + max glibc symbol), and uploads it. Additive; does not touch the deploy path. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

Temporary push trigger (scoped to this branch + the relevant paths) so the dispatch-only spike can actually run without first living on the default branch. Remove once we have a green run. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

… at link The cold run failed because crane's buildDepsOnly runs plain `cargo check`, and it was handed `--target x86_64-unknown-linux-gnu.2.26` — the `.2.26` glibc suffix is a cargo-zigbuild-only concept, so rustc rejected it ("could not find specification for target"). Correct split: the glibc pin is a link-time concern. The dep closure now builds for the plain triple with ordinary cargo; only the final binary link uses `cargo zigbuild --target x86_64-unknown-linux-gnu.2.26`. cargo sees the same plain triple in both (zigbuild strips the suffix), so the closure's rlibs are reused and only the leaf crate compiles + zig-links. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

zig works in the sandbox and the dep closure cached, but the final zigbuild link hit aws-lc-sys's cc-builder guard ("COMPILER BUG DETECTED ... zigcc not supported"). aws-lc-rs is rustls's default crypto provider (pulled via aws-sdk + sqlx). Force aws-lc-sys onto its cmake builder (AWS_LC_SYS_CMAKE_BUILDER=1, + cmake/nasm), which lacks that guard and still compiles the C against the zig-pinned glibc. If this chains into more build deps, the cleaner alternative is dropping aws-lc-rs for the ring backend. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

flake.nix: derive the lambda package set from services-config.json (crate == dir == deploy_lambdas entry for all 17 handlers), so the flake never drifts from the deploy config. One shared dep closure now spans every lambda package; deployLambdaPackages exposes deploy-lambda-<name> for each. Spike workflow becomes an all-handlers validation matrix on a DEDICATED lambda sticky disk (${repo}-nix-lambdas, separate from the binaries' /nix-store disk): setup -> warm-lambdas (shared closure) -> per-handler build matrix with a glibc check. This proves every handler compiles + links against the pinned Lambda glibc before we flip the production deploy path, and prototypes the two-disk lambda topology to be lifted into deploy-all-services. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

Split the single warm-deps job into parallel warm-binaries (deployCargoArtifacts on <repo>-nix-store) and warm-lambdas (lambdaDeployCargoArtifacts on <repo>-nix-lambdas) — separate sticky disks so the two chains never collide on a last-write-wins commit, and neither warm blocks the other's build matrix. Dropped the now-obsolete dev-shell warm (nothing uses `nix develop` anymore). build-service-binaries now needs warm-binaries; build-lambda-artifacts needs warm-lambdas, mounts the lambda /nix disk (not the binary one), drops the per-service cargo-target disk and the sccache-stats step, and builds via `nix build .#deploy-lambda-<name>` (new build-cloud-storage-lambdas-nix.sh) instead of cargo-lambda-in-a-checkout. Same target/lambda/<name>/bootstrap.zip layout, so the handoff + deploy action are unchanged. Result: unchanged handlers are pure nix cache hits (no mtime recompile), and a service's handlers build in parallel within one nix invocation. The old cargo-lambda script stays for the inline/other-workflow paths. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

… workflow The lambda crane build is validated (all 17 handlers green) and the deploy path is wired, so the feature-branch push trigger has served its purpose. Back to workflow_dispatch-only. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

The ~15s/job was the `cachix` CLI install (nix profile add), needed only to push via watch-store. Remove the setup-cachix step from warm-binaries, warm-lambdas, build-service-binaries and build-lambda-artifacts so they rely on the /nix sticky disks alone. setup-nix still puts nix on PATH, and nixpkgs deps still substitute from cache.nixos.org; only our own artifacts depend on the sticky disk now. Tradeoff (accepted for now): a cold/evicted sticky disk has no Cachix fallback, so it rebuilds from source. The leftover `cachix watch-store` guards are no-ops without the CLI. Re-enabling a cheap pull-only fallback later is just an extra-substituters line in setup-nix (no CLI install). https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

With setup-cachix gone, the cachix CLI isn't installed, so the watch-store push guards were permanent no-ops (and build-service-binaries logged a cosmetic "Cachix is unavailable" warning every run). Strip them from the warm jobs, the binary build step, and the lambda nix build script. The CACHIX_AUTH_TOKEN workflow_call secret is kept declared so existing callers don't error. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

…son-LXyqN

The deploy build used a fresh docker/setup-buildx-action builder each run, so heavy base layers (convert-service's LibreOffice + Collabora, ~780MB) were re-pulled from ECR every deploy (~130s at ~4MB/s on a cold builder). Add an opt-in `use-blacksmith-builder` input to the shared deploy composite that swaps in useblacksmith/setup-docker-builder -- a buildkitd builder whose /var/lib/buildkit cache lives on a per-Dockerfile sticky disk, set as the default builder that Pulumi's docker-build provider uses. deploy-all-services opts in; other callers (which may run on non-Blacksmith runners) keep the stock buildx builder. https://claude.ai/code/session_01R2zCM4cvNDRHPkN93Fw3DJ

…son-LXyqN

…son-LXyqN # Conflicts: # .github/scripts/build-cloud-storage-lambdas.sh # infra/packages/resources/src/resources/load_balancer.ts

…into claude/gracious-thompson-LXyqN

…s-thompson-LXyqN

github-actions Bot assigned synoet Jun 4, 2026

synoet changed the title ~~[PoC] Build deploy binaries on Blacksmith with a Nix sticky disk~~ Build deploy binaries on Blacksmith with a Nix sticky disk Jun 4, 2026

claude added 7 commits June 4, 2026 17:04

Add branch-scoped push trigger to Blacksmith PoC for branch testing

d8d8937

workflow_dispatch only works once a workflow is on the default branch, so add a push trigger scoped to the feature branch so the PoC runs directly from the branch. To be removed before merge.

Remove branch-scoped push trigger from Blacksmith PoC

1341148

The PoC was triggered on push to the feature branch for validation. Drop it so the workflow is workflow_dispatch-only again and won't auto-run.

github-actions Bot added the infra label Jun 4, 2026

claude added 4 commits June 4, 2026 19:25

Remove max-parallel cap on lambda builds too

f8ac243

Blacksmith autoscales runners, so cap the lambda build matrix at nothing and let all lambda services build concurrently like the binary matrix.

Merge remote-tracking branch 'origin/main' into claude/gracious-thomp…

2a20215

…son-LXyqN

github-actions Bot added the cloud-storage label Jun 4, 2026

claude added 12 commits June 4, 2026 20:01

Remove max-parallel cap on deploy-services

dd166ec

Let all services deploy concurrently like the build matrices.

claude and others added 11 commits June 5, 2026 14:44

Merge remote-tracking branch 'origin/main' into claude/gracious-thomp…

ce05de4

…son-LXyqN

Merge remote-tracking branch 'origin/main' into claude/gracious-thomp…

d717eb2

…son-LXyqN

Merge remote-tracking branch 'origin/main' into claude/gracious-thomp…

dc56c51

…son-LXyqN # Conflicts: # .github/scripts/build-cloud-storage-lambdas.sh # infra/packages/resources/src/resources/load_balancer.ts

Merge remote-tracking branch 'origin/claude/gracious-thompson-LXyqN' …

8138c60

…into claude/gracious-thompson-LXyqN

fix(ffmpeg): ffmpeg lambda layer nix

1305738

fixes

4a2e3a7

Merge branch 'main' of github.com:macro-inc/macro into claude/graciou…

afc068b

…s-thompson-LXyqN

synoet mentioned this pull request Jun 10, 2026

feat(ci): faster build and deploy: blacksmith runners, hakari, sticky disks, crane lambdas #3948

Merged

synoet closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build deploy binaries on Blacksmith with a Nix sticky disk#3768

Build deploy binaries on Blacksmith with a Nix sticky disk#3768
synoet wants to merge 36 commits into
mainfrom
claude/gracious-thompson-LXyqN

synoet commented Jun 4, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

synoet commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Cachix kept as fallback

⚠️ Requires Blacksmith provisioning — validate before merge

How to test

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

synoet commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading