feat: process injection via LiveServerlessMixin by deanq · Pull Request #260 · runpod/flash

deanq · 2026-03-06T20:21:29Z

Summary

Add injection.py with build_injection_cmd() for dockerArgs generation
Add base image constants (FLASH_GPU_BASE_IMAGE, FLASH_CPU_BASE_IMAGE)
Update LiveServerlessMixin to configure dockerArgs on templates for tarball injection
Add _default_base_image and _legacy_image properties to all Live* resource classes
Unify CPU base image to python:3.11-slim (matching GPU PyTorch runtime)
Worker tarball URL configurable via FLASH_WORKER_TARBALL_URL env var

Test plan

make quality-check passes
LiveServerless template includes dockerArgs with bootstrap command
FLASH_WORKER_TARBALL_URL=<url> flash deploy provisions endpoint with injection
Preview mode still works via legacy Docker images

Depends on: runpod-workers/flash#75 (tarball build pipeline)

Copilot

Pull request overview

Adds “process injection” support for Live* serverless resources by switching defaults to base images and generating template.dockerArgs that downloads/extracts/execs a flash-worker tarball at container start.

Changes:

Introduces build_injection_cmd() and new constants for base images + worker tarball configuration.
Updates LiveServerlessMixin / Live* resources to set default base images and populate template.dockerArgs for injection.
Updates unit/integration tests to validate new base-image defaults and presence of injection dockerArgs.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`src/runpod_flash/core/resources/injection.py`	New helper to generate the injection `dockerArgs` shell command.
`src/runpod_flash/core/resources/constants.py`	Adds base image constants and worker tarball URL/version configuration.
`src/runpod_flash/core/resources/live_serverless.py`	Live* resources now default to base images and set `template.dockerArgs` for injection.
`tests/unit/resources/test_injection.py`	New unit tests validating injection command format.
`tests/unit/resources/test_live_serverless.py`	Updates LiveServerless/CpuLiveServerless expectations + dockerArgs assertions.
`tests/unit/resources/test_live_load_balancer.py`	Updates LiveLoadBalancer expectations + dockerArgs assertions.
`tests/integration/test_lb_remote_execution.py`	Updates integration expectations for new image defaults and BYOI behavior.
`tests/integration/test_cpu_disk_sizing.py`	Updates integration expectations for base-image defaults and BYOI behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

src/runpod_flash/core/resources/injection.py

tests/integration/test_cpu_disk_sizing.py

Copilot · 2026-03-06T21:39:20Z

tests/unit/resources/test_injection.py

+    def test_default_remote_url(self):
+        """Test default remote URL generation."""
+        cmd = build_injection_cmd(worker_version="1.1.1")
+
+        assert cmd.startswith("bash -c '")
+        assert "FW_VER=1.1.1" in cmd
+        assert "flash-worker/releases/download/v1.1.1/" in cmd
+        assert "bootstrap.sh'" in cmd


This test assumes the default tarball URL template is the GitHub releases URL (asserting flash-worker/releases/download/...). Since FLASH_WORKER_TARBALL_URL is configurable via environment variable, the default template may differ in some test environments, causing a false failure. Consider asserting against FLASH_WORKER_TARBALL_URL_TEMPLATE.format(version=...) or passing an explicit tarball_url in the test.

src/runpod_flash/core/resources/injection.py

src/runpod_flash/core/resources/live_serverless.py

Copilot · 2026-03-06T21:39:21Z

src/runpod_flash/core/resources/live_serverless.py

+    """Configures process injection via dockerArgs for any base image.
+
+    Sets a default base image (user can override via imageName) and generates
+    dockerArgs to download, extract, and run the flash-worker tarball at container
+    start time. QB vs LB mode is determined by FLASH_ENDPOINT_TYPE env var at
+    runtime, not by the Docker image.
+    """

    @property
-    def _live_image(self) -> str:
-        """Override in subclasses to specify the locked image."""
-        raise NotImplementedError("Subclasses must define _live_image")
+    def _default_base_image(self) -> str:
+        raise NotImplementedError("Subclasses must define _default_base_image")

    @property
-    def imageName(self):
-        # Lock imageName to specific image
-        return self._live_image
+    def _legacy_image(self) -> str:
+        """Legacy Docker Hub image for preview mode."""
+        raise NotImplementedError("Subclasses must define _legacy_image")
+
+    def _create_new_template(self) -> PodTemplate:
+        """Create template with dockerArgs for process injection."""
+        template = super()._create_new_template()  # type: ignore[misc]
+        template.dockerArgs = build_injection_cmd()
+        return template

-    @imageName.setter
-    def imageName(self, value):
-        # Prevent manual setting of imageName
-        pass
+    def _configure_existing_template(self) -> None:
+        """Configure existing template, adding dockerArgs for injection if not user-set."""
+        super()._configure_existing_template()  # type: ignore[misc]
+        if self.template is not None and not self.template.dockerArgs:  # type: ignore[attr-defined]
+            self.template.dockerArgs = build_injection_cmd()  # type: ignore[attr-defined]


These Live* resources now rely on template.dockerArgs to perform injection at container start, but local preview (flash deploy --preview) starts containers via docker run <image> and does not apply template dockerArgs. With the new default base images, preview containers may not start the flash worker at all unless the preview path explicitly uses the legacy images or executes the injection command. Please ensure preview mode uses _legacy_image (or otherwise applies the injection command) before switching Live* defaults to base images.

tests/unit/resources/test_live_serverless.py

tests/integration/test_lb_remote_execution.py

runpod-Henrik

Bug: Network volume cache never activates — dotfile skipped by glob

The NV cache write path uses cp -r "$FW_DIR"/* "$NV_CACHE/". The * glob does not match dotfiles in bash, so .version is never written to the NV cache. The read gate checks [ -f "$NV_CACHE/.version" ], which is always false. Every cold start re-downloads the tarball regardless of whether a warm NV cache exists.

We can confirm this directly: create $FW_DIR with both bootstrap.sh and .version, run the exact glob command from the generated script, and .version is absent from $NV_CACHE while bootstrap.sh is present. The gate check then fails.

Fix: cp -r "$FW_DIR"/. "$NV_CACHE/" — the trailing /. copies directory contents including dotfiles.

Issue: NV cache read creates a subdirectory when `$FW_DIR` already exists

The NV cache read path uses cp -r "$NV_CACHE" "$FW_DIR". When $FW_DIR already exists (version upgrade scenario: old version in place, new NV cache available), cp -r copies the cache directory into $FW_DIR as a subdirectory. bootstrap.sh ends up at $FW_DIR/v1.1.1/bootstrap.sh instead of $FW_DIR/bootstrap.sh. The following exec $FW_DIR/bootstrap.sh fails with "No such file or directory".

Tested with real shell execution: mkdir fw_dir && cp -r nv_cache fw_dir → fw_dir/nv_cache/bootstrap.sh exists, fw_dir/bootstrap.sh does not. Without bug A this would only surface during version upgrades. With bug A fixed (.version now present in NV cache), this would become the primary failure.

Fix: mkdir -p "$FW_DIR" && cp -r "$NV_CACHE"/. "$FW_DIR"/ — consistent with the fix for bug A, and safe whether or not $FW_DIR exists.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runpod-Henrik

Test plan results — PR #260 Process Injection

Branch tested: pr-260
Python versions: 3.10.19, 3.11.14, 3.12.12, 3.13.11
Results: identical across all four versions

Result	Count
PASS	53
FAIL	0

All scenarios in the test plan pass cleanly.

Scenarios covered:

Cold start — injection command structure (download + exec, set -e, FW_DIR creation, --strip-components=1)
Warm restart — local cache check and branch logic
NV cache write — dotfile inclusion via /. glob
NV cache read — cp -r src/. places files at root on warm containers
Custom tarball URL (env var override)
file:// local path (direct extraction, bootstrap exec)
Deployment configuration — GPU/CPU worker templates carry injection
LB endpoints — GPU LB and CPU LB templates carry injection
Custom image — imageName preserved, injection in dockerArgs
NV cache write/read bug fixes verified
Version upgrade sequence end-to-end

🤖 Reviewed by Henrik's AI-Powered Bug Finder

runpod-Henrik

Code Review — PR #260: Process Injection via LiveServerlessMixin

Reviewer: Henrik's AI-Powered Bug Finder
Testing: Full unit suite (2521 passed, 0 new failures), 53/53 plan tests on pr-260 branch, real deploy on RunPod confirmed workers boot and receive jobs.

1. Context

This PR is the architectural solution to the Python version mismatch problem (build machine Python ≠ deployed runtime Python). Rather than maintaining a matrix of 12 versioned images, injection allows any base image to be used — flash-worker is downloaded and bootstrapped at container startup via dockerArgs. PR #261 (version-aware image selection) is intended to merge first as a short-term guardrail; this PR supersedes the image-picker once injection is the default.

2. NV cache bugs — fixed correctly

Write bug (cp -r "$FW_DIR"/* "$NV_CACHE/" → cp -r "$FW_DIR"/. "$NV_CACHE/"): glob * skips the .version dotfile, so the cache gate always fails and workers re-download on every cold start. Fixed.

Read bug (cp -r "$NV_CACHE" "$FW_DIR" on a warm container): when $FW_DIR already exists, cp -r src dest places the source inside the destination. bootstrap.sh ends up at $FW_DIR/v1.x.x/bootstrap.sh instead of $FW_DIR/bootstrap.sh. Worker crashes after any version upgrade on a container that has run before. Fixed with mkdir -p "$FW_DIR" && cp -r "$NV_CACHE"/. "$FW_DIR/".

Both fixes verified by the test suite.

Issue: Test has wrong expected URL

test_injection_cmd_embeds_download_url asserts github.com/runpod/flash-worker but the actual URL in constants.py is github.com/runpod-workers/flash. The test passes on main (import fails before the assertion) but will fail as soon as the injection module is importable. Needs a one-line fix.

Question: HTTPS not enforced on custom tarball URL

FLASH_WORKER_TARBALL_URL env var can override the download URL. If a user or deployment sets an http:// URL, the bootstrap command downloads and executes an unsigned binary over plaintext — susceptible to MITM on the bootstrap download. Worth either validating the scheme in build_injection_cmd or documenting that only HTTPS URLs are supported.

Note: Download overhead on first cold start

Injection adds a download-and-extract step before bootstrap.sh executes. On GPU workers that import torch, the combined time (download + CUDA init + first import) can approach or exceed the flash-worker's default job execution timeout. Verified in a real deploy: jobs received workerIds and ran but timed out at ~40s. Not a blocker — the NV cache path eliminates this overhead on subsequent starts — but worth documenting or raising the default execution timeout budget for GPU workers.

Note: Image-picker from PR #261 becomes dead code

Once this PR is the default, the version-aware image selection logic added by PR #261 is no longer the primary path. The python_version manifest field from #261 remains useful as an audit trail. Worth a comment in the relevant code noting the intended deprecation.

Nits

FLASH_WORKER_VERSION = "1.1.1" is a hardcoded string constant. Worth a comment explaining the versioning policy (manual bump per flash-worker release? tied to flash version?).
The injection command is a long bash one-liner. A brief comment explaining the three branches (NV cache hit / local cache hit / download) would help future readers.

Verdict: PASS WITH NITS

Core mechanism is sound, NV cache bugs are correctly fixed, real deploy confirmed. Fix the test URL before merge; the HTTPS question is worth a quick decision either way.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

KAJdev · 2026-03-07T23:34:35Z

Honestly I think this is better implemented via host. This PR's implementation would mean we are re-downloading 100MB on every cold start at least, and I think there's an issue with the copy that means its downloading every time anyways, so its adding a decent amount of time to every single flash request.

We can't write to the shadow volume for the archive, and so we are betting on users using a network volume themselves, not a good solution in my opinion. We should handle this in host similar to how we manage sls-local-server.

deanq · 2026-03-08T02:45:54Z

Honestly I think this is better implemented via host. This PR's implementation would mean we are re-downloading 100MB on every cold start at least, and I think there's an issue with the copy that means its downloading every time anyways, so its adding a decent amount of time to every single flash request.

We can't write to the shadow volume for the archive, and so we are betting on users using a network volume themselves, not a good solution in my opinion. We should handle this in host similar to how we manage sls-local-server.

Agreed. Let's work towards a solution at the host level. I'm going to use this branch as an experiment and POC of the solution. The entire 100MB injectable worker is draft. That can be improved and get leaner. The goal remains the same: injectable worker through the start command.

KAJdev · 2026-03-08T02:47:15Z

if we go with host though, no need to use the start command as an inject method either

Replace pre-built Docker images with runtime tarball injection. The LiveServerlessMixin now generates dockerArgs that download, extract, and bootstrap the flash-worker tarball at container start time. - Add injection.py with build_injection_cmd() for dockerArgs generation - Add base image constants (FLASH_GPU_BASE_IMAGE, FLASH_CPU_BASE_IMAGE) - Update LiveServerlessMixin to configure dockerArgs on templates - Add _default_base_image and _legacy_image properties to all Live* classes - Update tests for injection-based template configuration - Revert InjectableWorkerMixin rename back to LiveServerlessMixin

Match test expectation with unified Python 3.11 base image change in FLASH_CPU_BASE_IMAGE constant.

- Fix tarball URL from runpod/flash-worker to runpod-workers/flash (matching the actual GitHub repo path) - Add python3 urllib.request fallback in download chain for base images without curl or wget (e.g. pytorch/pytorch runtime images) - Update test assertions for URL and fallback chain

Model validators were unconditionally overwriting imageName, and the mixin property was ignoring the stored value. Now validators only set defaults when imageName is not provided, and the no-op property is removed. Tests updated for new image naming scheme.

runpod-Henrik · 2026-03-10T19:45:48Z

QA Review — Process injection via LiveServerlessMixin

What changed: Live serverless and load-balancer resources now use generic base images instead of flash-specific Docker images. The flash-worker runtime is downloaded and extracted at container start via a dockerArgs bootstrap command. Users can now supply their own base image; previously the image was locked and user-provided values were silently dropped.

What was tested:

Bootstrap command is present on all four Live* resource templates on construction
User-provided imageName is preserved — not overwritten by the default base image
User-provided dockerArgs on an existing template are not overwritten by the injection command
FLASH_WORKER_TARBALL_URL env var correctly overrides the tarball URL in the generated command
Preview mode (flash deploy --preview) is unaffected — it does not go through the updated code path

What was NOT tested:

E2E deploy + cold start with injection — blocked until flash-worker chore: fix readme link #75 merges

Verdict: Incomplete — the core scenario (worker boots and runs via injected tarball) cannot be validated until flash-worker #75 merges. Worth adding a test for user-provided dockerArgs preservation before merge.

deanq requested a review from Copilot March 6, 2026 21:32

Copilot started reviewing on behalf of deanq March 6, 2026 21:33 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

runpod-Henrik reviewed Mar 6, 2026

View reviewed changes

runpod-Henrik reviewed Mar 7, 2026

View reviewed changes

runpod-Henrik mentioned this pull request Mar 7, 2026

feat: Python version-aware Docker image selection #261

Merged

5 tasks

deanq added 3 commits March 9, 2026 19:50

fix(test): update CPU base image assertion to python:3.11-slim

5e3128d

Match test expectation with unified Python 3.11 base image change in FLASH_CPU_BASE_IMAGE constant.

deanq force-pushed the deanq/ae-2390-process-injectable-worker branch from 857aa51 to 31c81cd Compare March 10, 2026 02:53

Conversation

deanq commented Mar 6, 2026

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Bug: Network volume cache never activates — dotfile skipped by glob

Issue: NV cache read creates a subdirectory when $FW_DIR already exists

Uh oh!

runpod-Henrik left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Test plan results — PR #260 Process Injection

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Code Review — PR #260: Process Injection via LiveServerlessMixin

1. Context

2. NV cache bugs — fixed correctly

Issue: Test has wrong expected URL

Question: HTTPS not enforced on custom tarball URL

Note: Download overhead on first cold start

Note: Image-picker from PR #261 becomes dead code

Nits

Uh oh!

KAJdev commented Mar 7, 2026

Uh oh!

deanq commented Mar 8, 2026

Uh oh!

KAJdev commented Mar 8, 2026

Uh oh!

runpod-Henrik commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Issue: NV cache read creates a subdirectory when `$FW_DIR` already exists

runpod-Henrik left a comment •

edited

Loading