NO-MERGE: AGX Branch#1
Draft
ShivanshVij wants to merge 37 commits intomainfrom
Draft
Conversation
Adds the libkrun-fork surface AGX needs to drive snapshot, finalization-barrier pause, and the cross-process control socket from outside the libkrun process: - krun_pause / krun_resume — pause/resume every vCPU thread, cross-thread safe, blocks until ack. Preconditions documented. - krun_get_guest_memory_range — expose host-virtual base + size of guest RAM so the streamer can read /proc/<pid>/mem. - krun_snapshot — serialize vCPU + KVM/VM state to a binary artifact. Caller must pause first; double-pause was the failure mode discovered during integration. - VcpuEvent::SaveState — new state-machine arm in the paused() state that returns a VcpuState through a one-shot mpsc channel, mirroring how Pause/Resume are implemented. - serialize_full_state — magic + version + per-arch byte layout; 10884 B for a 1-vCPU guest in the integration test. x86_64-only for v1; the aarch64 build paths are still vacuous. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the full restore path that the AGX side (krun_resume_from C API) needs to bring a snapshotted guest back to life: - vmm::snapshot module: serialize/deserialize/read_artifact for the vCPU + VM state binary. Single source of truth, shared by the writer (krun_snapshot) and the reader (build_microvm restore branch). Wire format documented in module doc. - vstate::Vcpu: snapshot_restore_state public wrapper, mirror of snapshot_save_state. Calls KVM_SET_* in the documented order required after KVM_CREATE_VCPU. - Vmm: restore_vm_state wrapper for VmState (PIT/CLOCK/IRQCHIP), guest_memory_mut accessor for write-side restore. - VmResources::restore_from + RestoreContext: opt-in flag that switches build_microvm to the restore path. - builder::build_microvm: when restore_from is set, allocate guest memory from the snapshot's per-region layout, splat memory.bin into the host-virtual mapping, skip choose_payload / load_cmdline / configure_system, build vCPUs with kernel_boot =false, then apply VcpuStates + VmState before start_vcpus. - builder::load_snapshotted_memory + read_memory_layout: parse the AGXMEM01 layout header (region count + (gpa,len) tuples) and stream region bytes into guest memory. - libkrun::krun_snapshot_memory: dump full guest RAM to a file. Layout header followed by concatenated region bytes; the restore path reads the same format back. - libkrun::krun_resume_from: store paths on the context's VmResources::restore_from; the next krun_start_enter takes the restore path. krun_start_enter additionally skips libkrunfw load when restore_from is set. Out of scope (deferred): per-device state save/restore for virtio-blk/net/console/vsock. Devices come up fresh on restore; whether the guest tolerates that depends on the kernel inside. For sandbox VMs running stock Linux >= 6.7, fresh device re-init at the host edge is treated like a hardware reset and the kernel re-discovers them. Verified end-to-end: pause + krun_snapshot + krun_snapshot_memory on guest A, kill A, krun_resume_from on a fresh ctx, control socket binds and pause ack's on the restored VM. Roundtrip test in agx-vmm/tests/snapshot_resume_roundtrip.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the FFI surface for the CIDR egress policy AGX needs (plan §5.17). The TSI-side enforcement isn't here yet; the function returns -ENOSYS until that patch lands. AGX-side wrapper at agx_vmm::VmConfig::set_egress_policy is wired so the integration is one line on the day the TSI patch lands. Doc-string in lib.rs spells out the patch shape (parse JSON → store on VmResources → consult in muxer's outbound connect). Reference: docs/egress-policy.md in the AGX tree. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns per-region (gpa_start, length, host_addr) triples. The gpa/len pairs match the AGXMEM01 layout-header that krun_snapshot_memory writes; host_addr lets the source's live- migration code read each region from /proc/<pid>/mem at the correct host VA (regions are NOT contiguous in host VA — a typical libkrun setup has 4 regions at scattered addresses). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krun_set_egress_policy now parses the JSON policy and stores it on the ctx; the policy threads through VsockDeviceConfig to the muxer and is attached to every TsiStreamProxy at creation time. The TSI proxy's connect path consults the policy before issuing the host kernel connect(); Deny verdicts return ECONNREFUSED to the guest without touching the host network. Default verdict is Deny. The JSON parser is hand-rolled (no serde dep added to libkrun) and accepts the wire format produced by agx_net::EgressPolicy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds userfaultfd-WP (async) registration on guest memory regions — required by PAGEMAP_SCAN's PM_SCAN_WP_MATCHING flag, which is the proper per-range atomic find-dirty-and-rearm primitive (the kernel silently skips VMAs that aren't userfaultfd_wp_async; see fs/proc/task_mmu.c::pagemap_scan_test_walk). Two C APIs: - krun_set_uffd_wp_enabled(ctx, enabled): sets a flag on the ctx config; libkrun auto-registers right after the VMM is built and goes into RUNNING_VMMS. Call this BEFORE krun_start_enter. - krun_register_uffd_wp(ctx): manual registration on a running VMM (idempotent). Available in case auto-registration isn't an option. Implementation opens userfaultfd with WP_ASYNC | WP_UNPOPULATED, issues UFFDIO_REGISTER + UFFDIO_WRITEPROTECT per region. The fd is stashed in a static map keyed by ctx_id; closing it would unregister the VMAs. Caveat: requires unprivileged userfaultfd (vm.unprivileged_userfaultfd=1) on the host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a hook in build_microvm's restore path that runs right
before vCPUs start. Triggered by env
AGX_BOOT_UFFD_MISSING_LAZY=<lazy_offsets_file>:<send_sock>:
1. Open /dev/userfaultfd (Linux 6.1+ — falls back to syscall)
with UFFD_FEATURE_THREAD_ID + UFFD_FEATURE_EXACT_ADDRESS.
2. UFFDIO_REGISTER_MODE_MISSING on every guest memory region.
3. Read the lazy-offsets file (binary u64-BE absolute offsets
in the AGXMEM01 body — same format as the wire's
LazyPageList).
4. madvise(MADV_DONTNEED) each lazy page so its physical page
is freed; subsequent guest access faults via uffd-MISSING.
5. Connect to send_sock (a unix socket the dest controller is
listening on) and send the uffd fd via SCM_RIGHTS.
6. Leak the local fd handle so the dup'd fd on the controller
side stays valid.
The dest controller dups the uffd fd, runs poll(uffd) in one
thread + a bg fetcher in another. Faulting vCPUs' wake-up
flows through UFFDIO_COPY by the bg fetcher (after a priority-
push from the fault handler).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the egress policy's `EgressVerdict` enum with `Redirect`, plus a `redirect_target: Option<SocketAddrV4>` field on `EgressRule` carrying the host-side L4 proxy endpoint when the verdict is Redirect. The TSI muxer (separate commit) will honor the verdict by rewriting the connect destination to `redirect_target` and writing the 12-byte AGX-L4-redirect header `[magic_be32][orig_addr_be32][orig_port_be16][pad_be16]` on the host socket before bridging guest bytes. The host-side L4 proxy reads that header to recover the original destination — transparent to the guest. JSON parser updated to recognize: - `verdict: "redirect"` (alongside existing "allow" / "deny") - `redirect_target: "<ip>:<port>"` field on the rule Mirror of `agx_net::EgressVerdict` + `EgressRule` host-side shape (in the AGX workspace's separate commit). Constants `AGX_L4_REDIRECT_MAGIC` / `AGX_L4_REDIRECT_HEADER_LEN` are duplicated between the two crates because libkrun's devices crate can't depend on agx_net. Tests: - `parses_redirect_verdict`: JSON round-trip for the new verdict + target field. - `redirect_header_layout_is_stable`: byte-exact wire layout of the 12-byte header. Anchors the contract between this crate and the host-side proxy. Stage A of AGX task containers#127 (L4 TCP+UDP MITM proxy). Stage B (TSI muxer wiring) is the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the AGX egress policy returns the new `Redirect`
verdict for an outbound connect:
1. The TSI muxer rewrites the connect destination to the
rule's `redirect_target` (an IPv4 host:port pointing
at the host-side L4 proxy).
2. Builds the 12-byte AGX-L4-redirect header from the
guest's ORIGINAL `(addr, port)`:
[magic_be32][orig_addr_be32][orig_port_be16][pad_be16]
3. Stashes it in the new `pending_redirect_header` field
on `TsiStreamProxy`.
4. `switch_to_connected` (the convergence point for both
sync `connect()=Ok` and async EINPROGRESS→EPOLLOUT
completion) calls `flush_redirect_header`, which writes
the 12 bytes blocking on the freshly-connected host
socket BEFORE any guest-side bytes are bridged.
The host-side L4 proxy reads exactly 12 bytes first to
recover the original destination — the redirect is
transparent to the guest.
v6-destination redirects are not yet supported (the
proxy listens on v4 loopback). Such cases fall through
to ECONNREFUSED with a warning. UDP redirects are also
TCP-only at the muxer layer for now; the host side
already has UDP forwarder machinery (in agx-l4-proxy)
ready for when this lands.
Stage B of AGX task containers#127. Exposes `build_redirect_header`
+ `EgressPolicy::evaluate_full` from the egress_policy
module so the muxer can build headers and read
redirect_targets in one step.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New libkrun device path that bypasses the imago
file/format-driver stack and dispatches every virtio-blk
read/write/flush to a C callback bundle (`AgxBlkCallbacks`).
Lets external callers — notably AGX's silo Provider chain
in `agx-storage` — back a virtio-blk disk without writing
to a host file or attaching an NBD device.
Wire shape (matches `agx_vmm::ffi::AgxBlkCallbacks` on the
host side):
```c
struct agx_blk_callbacks {
void *user;
uint64_t (*size)(void *user);
int32_t (*read) (void *user, uint64_t off, uint8_t *buf, uint32_t len);
int32_t (*write)(void *user, uint64_t off, const uint8_t *buf, uint32_t len);
int32_t (*flush)(void *user);
};
```
Read/write/flush return 0 on success or -errno on failure.
The `user` pointer is opaque; libkrun stashes it in the
device and passes it as the first argument of every
callback. Caller MUST keep `user` alive for the lifetime
of the libkrun ctx.
New entry points:
- `CallbackStorage` (devices/virtio/block/callback_storage.rs)
— implements imago's `DynStorage` over the callbacks.
All async methods box a `Future` that just calls the
sync C ABI inside an `async` block; no real suspension.
- `Block::new_with_callbacks` — constructs a Block whose
`disk_image` is a `SyncFormatAccess<CallbackStorage>`
wrapping the callbacks via the Raw format driver.
- `BlockBuilder::insert_callback`,
`VmResources::add_callback_block_device` — same plumbing
as the file path, separate vec.
- `krun_add_disk_callbacks(ctx, block_id, callbacks, ro,
sync_mode)` — C API.
5 unit tests in callback_storage.rs cover size round-trip,
write→read, flush invocation count, EOF zero-fill, and
errno translation. Workspace `make all` green.
NOT yet wired in this commit (deferred to follow-up
sub-tasks of AGX task containers#128):
- Snapshot/restore: the callback binding gets lost on
resume_from. The host needs a per-snapshot rebind hook
to re-attach the Provider after restore.
- NBD/sparse-file removal in agx-storage's `expose/`
module — switch all consumers to the callback path
before deleting the old code.
Stage 1-2 of task containers#128. The host-side bridge
(`agx_vmm::blk_callback`) + end-to-end test land in the
parent-repo commit referencing this submodule.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When epoll returns EPOLLIN and EPOLLHUP on the same wakeup, process_event() was checking HUP first and returning early without ever calling recv_pkt(). For a request/response upstream that writes then immediately closes (HTTP Connection: close, busybox-nc, etc.), the response bytes are still pending on the host socket but the guest gets RST without seeing them. Drain the data first when HANG_UP fires on a Connected proxy, then push the RST. The drained bytes land in the RX ring ahead of the RST so the guest receives data before the close. Fix ported from smolvm. The pre-existing else-branch also had a latent bug: the remove_proxy = (status == Listening) check ran AFTER status was already overwritten to Closed, so the result was always Deferred. Capture was_listening before mutation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For W2-4.1 of agx's S3-aware live-migration destination. Different shape from the existing AGX_BOOT_UFFD_MISSING_LAZY: - **Skip the bulk read of memory.bin into guest memory**. When AGX_BOOT_UFFD_MISSING_ALL is set, the restore path bypasses `load_snapshotted_memory`. memory.bin still has to exist and have a valid AGXMEM01 layout header (so `read_memory_layout` succeeds), but its body is not consumed — guest pages start as fresh anonymous mappings. The controller will populate them via UFFDIO_COPY as faults arrive (synchronous fetch from peer P2P stream or S3) and as the bg fetcher proactively pulls. - **Wait for a controller ready-byte before starting vCPUs.** After sending the uffd fd via SCM_RIGHTS, register_uffd_missing_all blocks on a one-byte read on the same socket. The controller sends the byte AFTER it has spawned its fault handler thread. Without this gate, the vCPU would start before the handler is wired up, fault immediately, and stall the kernel. Env value: `AGX_BOOT_UFFD_MISSING_ALL=<send_sock_path>` (no lazy file — every page is treated as needing controller-side population). MADV_DONTNEED is still issued per region as defense-in-depth in case `load_snapshotted_memory` ran (it shouldn't when MISSING_ALL is set, but skipping the load is gated on env-var presence, which a future caller could forget to set; DONTNEED makes the post-state idempotent).
Both register_uffd_missing_all (ALL) and register_uffd_missing_lazy (LAZY) now write the guest-memory region table on the same Unix stream they sent the uffd fd over, BEFORE blocking on the controller's ack-byte. Wire format (after the SCM_RIGHTS fd): u32 BE region_count per region: u64 BE gpa, u64 BE host_addr, u64 BE len Lets the controller side build its PostCopyRegion table without querying GuestMemoryLayout, which previously couldn't be queried until AFTER the ack-byte (because the VMM enters RUNNING_VMMS post-ack). With regions in hand pre-ack, the controller can run PR #20a's prefill phase (S3 / P2P pulls) while libkrun is still blocked at read(ack-byte).
Adds a public C API knob for setting the CPU template that filter_cpuid applies at vCPU configure time. Maps: "T2" → CpuFeaturesTemplate::T2 (Cascade-Lake-equivalent) "C3" → CpuFeaturesTemplate::C3 (Cascade-Lake minus AVX) "host", NULL → None (host CPU passthrough) Stored on VmResources via a new public method `set_cpu_template(Option<CpuFeaturesTemplate>)`. Unknown template names → -EINVAL. Unknown ctx_id → -ENOENT. Required by AGX PR containers#22's --cpu-template flag, which lets the sandbox + harness VMs share a stable CPU view across hosts for migration safety. Header declaration added to include/libkrun.h.
The TCP path consults the egress policy at connect-time: Deny → ECONNREFUSED, Redirect → rewrite to redirect_target + queue the 12-byte AGX-L4-redirect header for the post-connect flush, Allow → fall through. UDP had no equivalent — the host-side proxy in agx-l4-proxy was already implemented but every redirected datagram fell through libkrun's UDP path verbatim, never reaching it. This commit adds the parallel UDP path: - TsiDgramProxy gains `egress_policy` and `pending_redirect_header` fields plus `set_egress_policy`, matching tsi_stream.rs. - `sendto_addr` consults the policy. Deny drops the address silently (UDP has no error channel back to the guest like TCP's ECONNREFUSED — the closest fit is "subsequent sendto_data calls become no-ops"). Redirect rewrites the destination to the rule's redirect_target and queues the 12-byte header recording the original dest. - `sendto_data` flushes the pending header as a standalone datagram to the rewritten target before sending the guest's payload. The host-side L4 proxy parses exactly 12 bytes from the first datagram of each new flow to recover the original destination — this matches what agx-l4-proxy/src/udp.rs already expected. - Muxer hands the policy Arc to TsiDgramProxy at create time (mirrors line 328's TCP wiring at line 352-368). - tsi_stream::sockaddr_to_addr_port goes from private to pub(super) so tsi_dgram can reuse it. Only IPv4 redirects are supported on the UDP path (same as TCP); v6 dst with redirect verdict drops with a warn!.
The prior commit covered the sendto(addr) explicit-dest UDP
path. nc -u (and most UDP clients) actually use connect()
to record the default peer, then send() — which the af_tsi
guest module dispatches via vsock as TSI_CONNECT then
DGRAM_RW (no per-datagram TSI_SENDTO_ADDR).
Mirrors the connect-time policy enforcement from
tsi_stream.rs into TsiDgramProxy::connect:
- Deny → push ConnResponse(-ECONNREFUSED) back to the
guest, no host-side connect.
- Redirect → build the 12-byte AGX header from the
original (orig_v4, port), queue
pending_redirect_header, replace the host-side
connect target with the rule's redirect_target.
- Allow → fall through.
sendmsg (the connected-UDP send path) flushes the pending
header as a standalone datagram before the guest's payload
— same shape as the existing sendto_data + tsi_stream
flush_redirect_header logic.
With this, both sendto(udp_fd, buf, dst) and
connect(udp_fd, dst) + send(udp_fd, buf) patterns exercise
the redirect.
V1-M1 of the AGX product reshape drops the `agx-` prefix on internal AGX library crates (per `references/product.md` §3.1). The AGX-side type that this libkrun-local mirror tracks is now `net::EgressPolicy` (was `agx_net::EgressPolicy`); update the "// Mirror of …" comments here so the cross-submodule reference stays accurate. No behaviour change — comments only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libkrun's set_exit_code() reports the entrypoint's exit
status to the host via an ioctl on the virtiofs root. The
ioctl is only meaningful when `/` IS virtiofs — and after
the krun_set_root_disk_remount path pivots from virtiofs to
the configured block device, `/` becomes ext4 (or whatever
the disk is) and is_virtiofs("/") returns 0. The function
silently no-ops; the host process sees exit code 0
regardless of what the entrypoint returned.
This is fine when the rootfs is virtiofs (every existing
libkrunfw consumer until now). It breaks when the rootfs is
a sparse ext4 disk image — exactly the path AGX V1-M3.5
takes for production VMs (the disk being the output of
oci-converter).
Fix: save a fd to the original virtiofs `/` BEFORE the
block-device pivot. The underlying virtiofs mount stays
alive as long as the fd is open, even though the mountpoint
moves out from under it; ioctl on the saved fd works
identically post-pivot. set_exit_code() prefers the saved fd
when present, else falls back to the original
"open `/` if virtiofs" path so the directory-rootfs
behaviour is unchanged.
The patch is in three small places:
1. Module-level `int agx_exit_code_fd = -1` and a helper
`agx_save_virtiofs_root_fd()` that opens `/` once and
stashes the fd.
2. The remount path calls the helper after `mkdir
/newroot` and before the actual `try_mount` of
/dev/vda — at this point `/` is still virtiofs.
3. set_exit_code() prefers `agx_exit_code_fd` over its old
"is_virtiofs check + fresh open" path; both branches
end with the same ioctl call.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krun_start_enter previously processed regular block devices first and callback-backed disks second, regardless of host-side add order. Tests that exercise the Provider→callback path expect their test-injected callback disk on /dev/vda; with the previous order, any regular block device (e.g. a rootfs disk via krun_set_root_disk_remount) would steal /dev/vda and push the callback disk to /dev/vdb. Swapping the loops makes callback disks land on lower /dev/vd<N> slots, restoring the test contract. boot_alpine's rootfs-last disk-attach pattern relies on this to compute the rootfs's final device path correctly.
agxd wraps the rootfs in a FileStorage→DirtyTracker→callback chain so snapshot finalize can harvest the per-snap disk dirty bitmap (used for diff chains). The remount API previously refused this layout because the empty-block-cfgs check looked only at regular `add_disk` entries. Loosen it to also allow `add_disk_callbacks` rootfs. One-line guard change. No behavior change for the regular-disk path (block_cfgs still populated → check passes as before).
The streaming-restore path registers uffd-MISSING on every guest VMA in register_uffd_missing_all. Kernel uffd is one-fd-per-VMA: the follow-up auto-registration for uffd-WP returned EBUSY because the same VMA was already owned by the missing-mode fd. Skip the WP registration on that path with a logged warn. The stream-restored VM is not snapshot-able until cold-restarted (snapshot needs uffd-WP for PAGEMAP_SCAN); operator can stop+start to clear. Pairs with agxd's new uffd_pager module that owns the missing-mode fd.
The prior order was vcpu_states + restore_vm_state → register_uffd_missing → start_vcpus. With AGX_BOOT_UFFD_MISSING_ALL set, restore_vm_state's reads of guest memory (virtio queue heads, IRQCHIP descriptors) hit post-DONTNEED zero pages because uffd-missing wasn't wired yet. Move register_uffd_missing_all to BEFORE restore_vm_state when the env var is set. The agxd-side controller has the fault thread serving by the time it sends the ready-byte, so libkrun's read on the same stream unblocks only after the pager is responsive — restore_vm_state can fault-in pages on demand from the chain. The lazy variant (AGX_BOOT_UFFD_MISSING_LAZY) keeps its post-restore position because it relies on read_full_memory having loaded the full image first (only the lazy subset gets DONTNEED'd). Pairs with agxd's StartArgs.console_file plumbing so the post-resume guest kernel output is recoverable for further debugging.
The agxd continuous-checkpoint streamer's PAGEMAP_SCAN with PM_SCAN_WP_MATCHING requires the VMM's guest-memory VMAs to have userfaultfd-WP armed in WP_ASYNC mode. If the VMM is visible via RUNNING_VMMS (which the control-sock RPCs lookup through) before uffd-wp is registered, the streamer can race ahead and PAGEMAP_SCAN returns EACCES against unarmed VMAs on Yama-restricted kernels. Fix the ordering: extract the registration body into register_uffd_wp_with_vmm(ctx_id, &vmm) so krun_start_enter can call it BEFORE RUNNING_VMMS.lock().insert(...). The external krun_register_uffd_wp(ctx_id) entry point is kept unchanged.
AGX-side investigation revealed two separate symptoms tracked at agxd containers#287 and containers#268: 1. Resumed VM exits cleanly within ~250ms with KVM_EXIT_SHUTDOWN and zero console output (triple-fault on first instruction). 2. Long hot-pause (krun_pause for 80+ seconds) trips the guest's RCU watchdog on resume. Both have the same root cause: the guest's pvclock thinks no time has passed (because KVM faithfully restores tsc_to_system_mul + system_time_at_snapshot) BUT the wall clock has advanced. On the next clock-update, the guest's RCU/softlockup watchdogs see "I haven't ticked for X seconds" → panic → triple-fault → SHUTDOWN. Fix: call KVM_KVMCLOCK_CTRL per-vcpu at the end of restore_state and on the Pause→paused transition. The ioctl sets PVCLOCK_GUEST_STOPPED in the next-emitted pvclock structure; Linux's guest pvclock driver clears its time-since-last-tick counters when it sees the flag. Same fix firecracker applies in arch/x86_64/vcpu.rs::restore_state (line 715). Best-effort — EINVAL when the guest didn't activate kvmclock is ignored. Also pin krun-devices's rand back to 0.9.2: the dependabot bump to 0.10.1 (commit b82735c on main) didn't update call sites, breaking the build with two API errors. The agx branch already had source code working with rand 0.9.2 from a prior commit. Co-author: K. Pivklock (kvmclock_ctrl analysis)
Adds VcpuState::tsc_khz (Option<u32>) and bumps the snapshot wire format to v2: - save_state: KVM_GET_TSC_KHZ best-effort. Hosts that don't support it leave the field None. - restore_state: when both saved and host TSC frequencies are known and differ by >250 ppm (firecracker's tolerance), call KVM_SET_TSC_KHZ to scale the vcpu. On same-host restore (V1 default) the values match and this is a no-op. - snapshot::serialize: writes the 1-byte present flag + u32 LE value after the cpuid block per vcpu. - snapshot::deserialize: backwards-compatible — accepts both format v1 (no tsc_khz) and v2 (with). v1 maps to None. Required for V1.1 cross-host live migration where source and destination hosts may have different TSC frequencies. Mirror of firecracker's arch/x86_64/vcpu.rs::restore_state TSC handling. Closes the latent gap behind agx-side task containers#287.
Removed `quiet` from DEFAULT_KERNEL_CMDLINE and added `loglevel=8 earlyprintk=keep panic_print=0xff` so guest kernel printk reaches the host's `console_output` file. AGX surfaces this via per-VM `console.log` for `agx daemon logs <vm>` style tooling. Caveat: x86_64 libkrun has no 8250 serial emulation, so `earlyprintk` has nowhere to go before virtio-console initializes — early panics during init.krun are still silent. Tracked as a separate AGX-side todo (8250 emulation or libkrun-internal tracing). Upstream keeps `quiet` because libkrun's CLI users typically don't route console output anywhere, so the chatter is wasted.
…el_path
Three changes in this commit:
1. DEFAULT_KERNEL_CMDLINE adds:
- earlycon=uart8250,io,0x3f8,115200 — register an early console
against COM1 from the kernel's first printk
- earlyprintk=ttyS0,115200,keep — older x86 driver fallback
- console=ttyS0,115200 — keep ttyS0 active alongside hvc0
- ignore_loglevel — never filter messages out of the early window
2. builder.rs autoconfigure_console_ports + serial fallback:
- When console_output is set and no explicit serial is configured
and disable_implicit_console is false, attach a sink-input +
append-write-to-file output Serial at COM1. Pairs with the
earlycon= cmdline so kernel boot output reaches console.log
before virtio-console probes.
- Replace File::create with OpenOptions append-only so the
virtio-console handle does NOT truncate the early-boot bytes
written by the COM1 serial earlier in build_microvm.
3. krun_set_kernel_path C API + KrunfwBindings::from_path:
- Take an absolute .kernel-file path and dlopen it instead of
relying on the SONAME-resolved libkrunfw.so.5. Lets agxd
ship multiple kernel artifacts (e.g. linux-6.12.76.kernel)
and pick per-VM via krun_set_kernel_path.
- init/init.c gains a krun_trace() helper that writes
"[init.krun] msg" to /dev/console; an opening trace marker
fires at main() entry to surface the silent window between
kernel start and config-parsing.
Signed-off-by: Shivansh Vij <shivanshvij@loopholelabs.io>
Earlier commit 1dfba26 unconditionally appended: earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200 earlyprintk=ttyS0,115200,keep loglevel=8 ignore_loglevel That broke the integration tests that don't set console_output: with `earlycon=` in the cmdline but no matching 8250 emulation behind 0x3f8 (the AGX implicit COM1 fallback only runs when console_output is set), the kernel poked the I/O port repeatedly with no response, and the shutdown path no longer cleanly drove KVM_EXIT_SHUTDOWN — vm-launch hung instead of exiting after `reboot: machine restart`. Fix: revert DEFAULT_KERNEL_CMDLINE to the pristine pre-1dfba26 form (just `panic_print=0xff` over upstream), and append `earlycon=uart8250,io,0x3f8,115200` from build_microvm() only when an implicit COM1 is actually about to be wired (console_output is Some).
agxd / vmm wire extra virtiofs shares (e.g. <vm-dir>/host-share for the per-session MITM CA cert) and pass tag→mountpoint pairs via the AGX_VIRTIOFS_MOUNTS env var. init.krun parses it and mounts each share before exec'ing the entrypoint, so the guest entrypoint never has to know about them — systemd or busybox-sh both see /.agx-host/ca.pem as just a regular file on a pre-existing mount. Format: AGX_VIRTIOFS_MOUNTS=<tag>:<mountpoint>[,<tag>:<mountpoint>…]. Empty / missing env: no shares, no behavior change. Best-effort error handling: bad entries log via krun_trace + perror but never abort init — a missing share is preferable to a wedged guest. This replaces docs/TODO.md §1's env-based CA delivery, which overflowed Linux's hardcoded `COMMAND_LINE_SIZE=2048` (the base64-encoded cert plus the rest of the cmdline tripped __fortify_panic in the kernel before init even ran).
Closes the docs/TODO §1 follow-up: supervisor's install_mitm_ca only runs in supervisor mode, so systemd-as-PID-1 images had the cert mounted at /.agx-host/ca.pem but never installed at the system trust store. Doing the install in init.krun before exec'ing the entrypoint covers every init mode (systemd, agx-supervisor, /bin/sh) uniformly. Steps (all best-effort, all fail-soft): 1. If AGX_MITM_CA_SOURCE (default /.agx-host/ca.pem) is readable, copy it to AGX_MITM_CA_PATH (default /usr/local/share/ca-certificates/agx.crt). mkdir-p the parent. 2. If /etc/ssl/certs/ca-certificates.crt exists and doesn't already contain "AGX Session CA" (the AGX cert's CN), append the cert bytes to that bundle. This is what `update-ca-certificates` does, minimised — every libssl/curl/openssl that reads the bundle now trusts the AGX CA without us having to invoke a shell script from PID 1. 3. Symlink /etc/ssl/certs/agx.pem -> the install path so OpenSSL's X509_LOOKUP_hash_dir lookups (the hash-indexed trust dir) also find the cert. When AGX_MITM_CA_SOURCE doesn't exist (e.g. `agx vm start` bare- spawn with no MITM session), the whole block silently skips — zero behavior change for non-MITM boots.
Before exec'ing the configured entrypoint, init.krun checks for `AGX_NETSTACK_STATIC_IP=<ip>/<prefix>:<gw>[:<dns>]` and configures eth0 via raw ioctls (SIOCSIFADDR, SIOCSIFNETMASK, SIOCSIFFLAGS, SIOCADDRT). Also writes /etc/resolv.conf with the DNS server (defaults to <gw> since the AGX in-process netstack runs the DNS forwarder on the gateway IP). Lets OCI-converted images that don't ship a DHCP client or `ip` binary boot with full network connectivity. No guest-side prelude script needed — the agent's entrypoint sees a fully-configured eth0 from instruction zero.
The HIJACK_INET / HIJACK_UNIX flags do two things at once: gate explicit-AF_TSI socket(46,...) proxy creation in the host-side muxer AND inject `tsi_hijack` into the guest kernel cmdline (so libc routes ALL PF_INET/PF_UNIX through TSI). AGX wants the first behavior but not the second: the in-guest binary needs to keep its TPROXY listener on a normal PF_INET socket while a separate `socket(AF_TSI)` path handles non-mesh egress, so a global hijack would break the divert. Add EXPLICIT_INET (1 << 2) and EXPLICIT_UNIX (1 << 3) bits that allow proxy creation when the guest explicitly requests AF_TSI but DO NOT cause builder.rs to add `tsi_hijack` to the cmdline. With these flags, AGX's MITM VM can run path-mode virtio-net (the pipeline bridge) and AF_TSI side-by-side.
…INFO UnixProxy::process_event used to early-return on HANG_UP without reading any data still pending in the kernel buffer. When a host peer does write-then-shutdown (e.g. agxd's per-VM CA-server writes the cert PEM and immediately closes), epoll_wait can return IN | HANG_UP in a single notification — the early-return then sent OP_RST to the guest and discarded the bytes. Guest read() returned EOF with 0 bytes; supervisor's `install_mitm_ca` retried 23+ times and gave up. Reorder so recv_pkt drains IN first, then HANG_UP handles the close — any bytes still pending land in the guest's RX queue before the RST. PortOutputLog also drops kernel printk + early-userspace stderr from log::Level::Error to Level::Info. It was never error output; the prior level polluted host stderr at error level on every boot under a default tracing-subscriber EnvFilter.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.