Skip to content

NO-MERGE: AGX Branch#1

Draft
ShivanshVij wants to merge 37 commits intomainfrom
agx
Draft

NO-MERGE: AGX Branch#1
ShivanshVij wants to merge 37 commits intomainfrom
agx

Conversation

@ShivanshVij
Copy link
Copy Markdown
Member

No description provided.

ShivanshVij and others added 30 commits May 1, 2026 07:09
Adds the libkrun-fork surface AGX needs to drive snapshot,
finalization-barrier pause, and the cross-process control
socket from outside the libkrun process:

- krun_pause / krun_resume — pause/resume every vCPU thread,
  cross-thread safe, blocks until ack. Preconditions documented.
- krun_get_guest_memory_range — expose host-virtual base + size
  of guest RAM so the streamer can read /proc/<pid>/mem.
- krun_snapshot — serialize vCPU + KVM/VM state to a binary
  artifact. Caller must pause first; double-pause was the
  failure mode discovered during integration.
- VcpuEvent::SaveState — new state-machine arm in the paused()
  state that returns a VcpuState through a one-shot mpsc
  channel, mirroring how Pause/Resume are implemented.
- serialize_full_state — magic + version + per-arch byte layout;
  10884 B for a 1-vCPU guest in the integration test.

x86_64-only for v1; the aarch64 build paths are still vacuous.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the full restore path that the AGX side (krun_resume_from C
API) needs to bring a snapshotted guest back to life:

- vmm::snapshot module: serialize/deserialize/read_artifact for
  the vCPU + VM state binary. Single source of truth, shared by
  the writer (krun_snapshot) and the reader (build_microvm
  restore branch). Wire format documented in module doc.
- vstate::Vcpu: snapshot_restore_state public wrapper, mirror of
  snapshot_save_state. Calls KVM_SET_* in the documented order
  required after KVM_CREATE_VCPU.
- Vmm: restore_vm_state wrapper for VmState (PIT/CLOCK/IRQCHIP),
  guest_memory_mut accessor for write-side restore.
- VmResources::restore_from + RestoreContext: opt-in flag that
  switches build_microvm to the restore path.
- builder::build_microvm: when restore_from is set, allocate
  guest memory from the snapshot's per-region layout, splat
  memory.bin into the host-virtual mapping, skip choose_payload /
  load_cmdline / configure_system, build vCPUs with kernel_boot
  =false, then apply VcpuStates + VmState before start_vcpus.
- builder::load_snapshotted_memory + read_memory_layout: parse
  the AGXMEM01 layout header (region count + (gpa,len) tuples)
  and stream region bytes into guest memory.
- libkrun::krun_snapshot_memory: dump full guest RAM to a file.
  Layout header followed by concatenated region bytes; the
  restore path reads the same format back.
- libkrun::krun_resume_from: store paths on the context's
  VmResources::restore_from; the next krun_start_enter takes
  the restore path. krun_start_enter additionally skips
  libkrunfw load when restore_from is set.

Out of scope (deferred): per-device state save/restore for
virtio-blk/net/console/vsock. Devices come up fresh on restore;
whether the guest tolerates that depends on the kernel inside.
For sandbox VMs running stock Linux >= 6.7, fresh device
re-init at the host edge is treated like a hardware reset and
the kernel re-discovers them.

Verified end-to-end: pause + krun_snapshot + krun_snapshot_memory
on guest A, kill A, krun_resume_from on a fresh ctx, control
socket binds and pause ack's on the restored VM. Roundtrip test
in agx-vmm/tests/snapshot_resume_roundtrip.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires the FFI surface for the CIDR egress policy AGX needs
(plan §5.17). The TSI-side enforcement isn't here yet; the
function returns -ENOSYS until that patch lands. AGX-side
wrapper at agx_vmm::VmConfig::set_egress_policy is wired so
the integration is one line on the day the TSI patch lands.

Doc-string in lib.rs spells out the patch shape (parse JSON
→ store on VmResources → consult in muxer's outbound connect).
Reference: docs/egress-policy.md in the AGX tree.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns per-region (gpa_start, length, host_addr) triples. The
gpa/len pairs match the AGXMEM01 layout-header that
krun_snapshot_memory writes; host_addr lets the source's live-
migration code read each region from /proc/<pid>/mem at the
correct host VA (regions are NOT contiguous in host VA — a
typical libkrun setup has 4 regions at scattered addresses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krun_set_egress_policy now parses the JSON policy and stores it on
the ctx; the policy threads through VsockDeviceConfig to the muxer
and is attached to every TsiStreamProxy at creation time. The TSI
proxy's connect path consults the policy before issuing the host
kernel connect(); Deny verdicts return ECONNREFUSED to the guest
without touching the host network. Default verdict is Deny.

The JSON parser is hand-rolled (no serde dep added to libkrun)
and accepts the wire format produced by agx_net::EgressPolicy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds userfaultfd-WP (async) registration on guest memory regions —
required by PAGEMAP_SCAN's PM_SCAN_WP_MATCHING flag, which is the
proper per-range atomic find-dirty-and-rearm primitive (the kernel
silently skips VMAs that aren't userfaultfd_wp_async; see
fs/proc/task_mmu.c::pagemap_scan_test_walk).

Two C APIs:
- krun_set_uffd_wp_enabled(ctx, enabled): sets a flag on the ctx
  config; libkrun auto-registers right after the VMM is built and
  goes into RUNNING_VMMS. Call this BEFORE krun_start_enter.
- krun_register_uffd_wp(ctx): manual registration on a running VMM
  (idempotent). Available in case auto-registration isn't an option.

Implementation opens userfaultfd with WP_ASYNC | WP_UNPOPULATED,
issues UFFDIO_REGISTER + UFFDIO_WRITEPROTECT per region. The fd is
stashed in a static map keyed by ctx_id; closing it would
unregister the VMAs.

Caveat: requires unprivileged userfaultfd
(vm.unprivileged_userfaultfd=1) on the host.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a hook in build_microvm's restore path that runs right
before vCPUs start. Triggered by env
AGX_BOOT_UFFD_MISSING_LAZY=<lazy_offsets_file>:<send_sock>:

  1. Open /dev/userfaultfd (Linux 6.1+ — falls back to syscall)
     with UFFD_FEATURE_THREAD_ID + UFFD_FEATURE_EXACT_ADDRESS.
  2. UFFDIO_REGISTER_MODE_MISSING on every guest memory region.
  3. Read the lazy-offsets file (binary u64-BE absolute offsets
     in the AGXMEM01 body — same format as the wire's
     LazyPageList).
  4. madvise(MADV_DONTNEED) each lazy page so its physical page
     is freed; subsequent guest access faults via uffd-MISSING.
  5. Connect to send_sock (a unix socket the dest controller is
     listening on) and send the uffd fd via SCM_RIGHTS.
  6. Leak the local fd handle so the dup'd fd on the controller
     side stays valid.

The dest controller dups the uffd fd, runs poll(uffd) in one
thread + a bg fetcher in another. Faulting vCPUs' wake-up
flows through UFFDIO_COPY by the bg fetcher (after a priority-
push from the fault handler).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends the egress policy's `EgressVerdict` enum with
`Redirect`, plus a `redirect_target: Option<SocketAddrV4>`
field on `EgressRule` carrying the host-side L4 proxy
endpoint when the verdict is Redirect.

The TSI muxer (separate commit) will honor the verdict by
rewriting the connect destination to `redirect_target` and
writing the 12-byte AGX-L4-redirect header
`[magic_be32][orig_addr_be32][orig_port_be16][pad_be16]`
on the host socket before bridging guest bytes. The
host-side L4 proxy reads that header to recover the
original destination — transparent to the guest.

JSON parser updated to recognize:
- `verdict: "redirect"` (alongside existing "allow" / "deny")
- `redirect_target: "<ip>:<port>"` field on the rule

Mirror of `agx_net::EgressVerdict` + `EgressRule` host-side
shape (in the AGX workspace's separate commit). Constants
`AGX_L4_REDIRECT_MAGIC` / `AGX_L4_REDIRECT_HEADER_LEN` are
duplicated between the two crates because libkrun's devices
crate can't depend on agx_net.

Tests:
- `parses_redirect_verdict`: JSON round-trip for the new
  verdict + target field.
- `redirect_header_layout_is_stable`: byte-exact wire layout
  of the 12-byte header. Anchors the contract between this
  crate and the host-side proxy.

Stage A of AGX task containers#127 (L4 TCP+UDP MITM proxy). Stage B
(TSI muxer wiring) is the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When the AGX egress policy returns the new `Redirect`
verdict for an outbound connect:

1. The TSI muxer rewrites the connect destination to the
   rule's `redirect_target` (an IPv4 host:port pointing
   at the host-side L4 proxy).
2. Builds the 12-byte AGX-L4-redirect header from the
   guest's ORIGINAL `(addr, port)`:
       [magic_be32][orig_addr_be32][orig_port_be16][pad_be16]
3. Stashes it in the new `pending_redirect_header` field
   on `TsiStreamProxy`.
4. `switch_to_connected` (the convergence point for both
   sync `connect()=Ok` and async EINPROGRESS→EPOLLOUT
   completion) calls `flush_redirect_header`, which writes
   the 12 bytes blocking on the freshly-connected host
   socket BEFORE any guest-side bytes are bridged.

The host-side L4 proxy reads exactly 12 bytes first to
recover the original destination — the redirect is
transparent to the guest.

v6-destination redirects are not yet supported (the
proxy listens on v4 loopback). Such cases fall through
to ECONNREFUSED with a warning. UDP redirects are also
TCP-only at the muxer layer for now; the host side
already has UDP forwarder machinery (in agx-l4-proxy)
ready for when this lands.

Stage B of AGX task containers#127. Exposes `build_redirect_header`
+ `EgressPolicy::evaluate_full` from the egress_policy
module so the muxer can build headers and read
redirect_targets in one step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New libkrun device path that bypasses the imago
file/format-driver stack and dispatches every virtio-blk
read/write/flush to a C callback bundle (`AgxBlkCallbacks`).
Lets external callers — notably AGX's silo Provider chain
in `agx-storage` — back a virtio-blk disk without writing
to a host file or attaching an NBD device.

Wire shape (matches `agx_vmm::ffi::AgxBlkCallbacks` on the
host side):

```c
struct agx_blk_callbacks {
    void *user;
    uint64_t (*size)(void *user);
    int32_t  (*read) (void *user, uint64_t off, uint8_t *buf,       uint32_t len);
    int32_t  (*write)(void *user, uint64_t off, const uint8_t *buf, uint32_t len);
    int32_t  (*flush)(void *user);
};
```

Read/write/flush return 0 on success or -errno on failure.
The `user` pointer is opaque; libkrun stashes it in the
device and passes it as the first argument of every
callback. Caller MUST keep `user` alive for the lifetime
of the libkrun ctx.

New entry points:
- `CallbackStorage` (devices/virtio/block/callback_storage.rs)
  — implements imago's `DynStorage` over the callbacks.
  All async methods box a `Future` that just calls the
  sync C ABI inside an `async` block; no real suspension.
- `Block::new_with_callbacks` — constructs a Block whose
  `disk_image` is a `SyncFormatAccess<CallbackStorage>`
  wrapping the callbacks via the Raw format driver.
- `BlockBuilder::insert_callback`,
  `VmResources::add_callback_block_device` — same plumbing
  as the file path, separate vec.
- `krun_add_disk_callbacks(ctx, block_id, callbacks, ro,
  sync_mode)` — C API.

5 unit tests in callback_storage.rs cover size round-trip,
write→read, flush invocation count, EOF zero-fill, and
errno translation. Workspace `make all` green.

NOT yet wired in this commit (deferred to follow-up
sub-tasks of AGX task containers#128):
- Snapshot/restore: the callback binding gets lost on
  resume_from. The host needs a per-snapshot rebind hook
  to re-attach the Provider after restore.
- NBD/sparse-file removal in agx-storage's `expose/`
  module — switch all consumers to the callback path
  before deleting the old code.

Stage 1-2 of task containers#128. The host-side bridge
(`agx_vmm::blk_callback`) + end-to-end test land in the
parent-repo commit referencing this submodule.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When epoll returns EPOLLIN and EPOLLHUP on the same wakeup,
process_event() was checking HUP first and returning early
without ever calling recv_pkt(). For a request/response
upstream that writes then immediately closes (HTTP
Connection: close, busybox-nc, etc.), the response bytes
are still pending on the host socket but the guest gets
RST without seeing them.

Drain the data first when HANG_UP fires on a Connected
proxy, then push the RST. The drained bytes land in the
RX ring ahead of the RST so the guest receives data
before the close. Fix ported from smolvm.

The pre-existing else-branch also had a latent bug: the
remove_proxy = (status == Listening) check ran AFTER
status was already overwritten to Closed, so the result
was always Deferred. Capture was_listening before mutation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For W2-4.1 of agx's S3-aware live-migration destination.
Different shape from the existing AGX_BOOT_UFFD_MISSING_LAZY:

- **Skip the bulk read of memory.bin into guest memory**.
  When AGX_BOOT_UFFD_MISSING_ALL is set, the restore path
  bypasses `load_snapshotted_memory`. memory.bin still has
  to exist and have a valid AGXMEM01 layout header (so
  `read_memory_layout` succeeds), but its body is not
  consumed — guest pages start as fresh anonymous mappings.
  The controller will populate them via UFFDIO_COPY as
  faults arrive (synchronous fetch from peer P2P stream or
  S3) and as the bg fetcher proactively pulls.

- **Wait for a controller ready-byte before starting vCPUs.**
  After sending the uffd fd via SCM_RIGHTS,
  register_uffd_missing_all blocks on a one-byte read on the
  same socket. The controller sends the byte AFTER it has
  spawned its fault handler thread. Without this gate, the
  vCPU would start before the handler is wired up, fault
  immediately, and stall the kernel.

Env value: `AGX_BOOT_UFFD_MISSING_ALL=<send_sock_path>` (no
lazy file — every page is treated as needing controller-side
population). MADV_DONTNEED is still issued per region as
defense-in-depth in case `load_snapshotted_memory` ran (it
shouldn't when MISSING_ALL is set, but skipping the load is
gated on env-var presence, which a future caller could
forget to set; DONTNEED makes the post-state idempotent).
Both register_uffd_missing_all (ALL) and register_uffd_missing_lazy
(LAZY) now write the guest-memory region table on the same Unix
stream they sent the uffd fd over, BEFORE blocking on the
controller's ack-byte.

Wire format (after the SCM_RIGHTS fd):
  u32 BE region_count
  per region: u64 BE gpa, u64 BE host_addr, u64 BE len

Lets the controller side build its PostCopyRegion table without
querying GuestMemoryLayout, which previously couldn't be queried
until AFTER the ack-byte (because the VMM enters RUNNING_VMMS
post-ack). With regions in hand pre-ack, the controller can run
PR #20a's prefill phase (S3 / P2P pulls) while libkrun is still
blocked at read(ack-byte).
Adds a public C API knob for setting the CPU template that
filter_cpuid applies at vCPU configure time. Maps:

  "T2"        → CpuFeaturesTemplate::T2 (Cascade-Lake-equivalent)
  "C3"        → CpuFeaturesTemplate::C3 (Cascade-Lake minus AVX)
  "host", NULL → None (host CPU passthrough)

Stored on VmResources via a new public method
`set_cpu_template(Option<CpuFeaturesTemplate>)`. Unknown
template names → -EINVAL. Unknown ctx_id → -ENOENT.

Required by AGX PR containers#22's --cpu-template flag, which lets the
sandbox + harness VMs share a stable CPU view across hosts
for migration safety.

Header declaration added to include/libkrun.h.
The TCP path consults the egress policy at connect-time:
Deny → ECONNREFUSED, Redirect → rewrite to redirect_target +
queue the 12-byte AGX-L4-redirect header for the post-connect
flush, Allow → fall through.

UDP had no equivalent — the host-side proxy in agx-l4-proxy
was already implemented but every redirected datagram fell
through libkrun's UDP path verbatim, never reaching it.

This commit adds the parallel UDP path:

- TsiDgramProxy gains `egress_policy` and
  `pending_redirect_header` fields plus `set_egress_policy`,
  matching tsi_stream.rs.
- `sendto_addr` consults the policy. Deny drops the address
  silently (UDP has no error channel back to the guest like
  TCP's ECONNREFUSED — the closest fit is "subsequent
  sendto_data calls become no-ops"). Redirect rewrites the
  destination to the rule's redirect_target and queues the
  12-byte header recording the original dest.
- `sendto_data` flushes the pending header as a standalone
  datagram to the rewritten target before sending the
  guest's payload. The host-side L4 proxy parses exactly 12
  bytes from the first datagram of each new flow to recover
  the original destination — this matches what
  agx-l4-proxy/src/udp.rs already expected.
- Muxer hands the policy Arc to TsiDgramProxy at create
  time (mirrors line 328's TCP wiring at line 352-368).
- tsi_stream::sockaddr_to_addr_port goes from private to
  pub(super) so tsi_dgram can reuse it.

Only IPv4 redirects are supported on the UDP path (same as
TCP); v6 dst with redirect verdict drops with a warn!.
The prior commit covered the sendto(addr) explicit-dest UDP
path. nc -u (and most UDP clients) actually use connect()
to record the default peer, then send() — which the af_tsi
guest module dispatches via vsock as TSI_CONNECT then
DGRAM_RW (no per-datagram TSI_SENDTO_ADDR).

Mirrors the connect-time policy enforcement from
tsi_stream.rs into TsiDgramProxy::connect:

- Deny  → push ConnResponse(-ECONNREFUSED) back to the
          guest, no host-side connect.
- Redirect → build the 12-byte AGX header from the
             original (orig_v4, port), queue
             pending_redirect_header, replace the host-side
             connect target with the rule's redirect_target.
- Allow → fall through.

sendmsg (the connected-UDP send path) flushes the pending
header as a standalone datagram before the guest's payload
— same shape as the existing sendto_data + tsi_stream
flush_redirect_header logic.

With this, both sendto(udp_fd, buf, dst) and
connect(udp_fd, dst) + send(udp_fd, buf) patterns exercise
the redirect.
V1-M1 of the AGX product reshape drops the `agx-` prefix on
internal AGX library crates (per `references/product.md` §3.1).
The AGX-side type that this libkrun-local mirror tracks is now
`net::EgressPolicy` (was `agx_net::EgressPolicy`); update the
"// Mirror of …" comments here so the cross-submodule reference
stays accurate. No behaviour change — comments only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
libkrun's set_exit_code() reports the entrypoint's exit
status to the host via an ioctl on the virtiofs root. The
ioctl is only meaningful when `/` IS virtiofs — and after
the krun_set_root_disk_remount path pivots from virtiofs to
the configured block device, `/` becomes ext4 (or whatever
the disk is) and is_virtiofs("/") returns 0. The function
silently no-ops; the host process sees exit code 0
regardless of what the entrypoint returned.

This is fine when the rootfs is virtiofs (every existing
libkrunfw consumer until now). It breaks when the rootfs is
a sparse ext4 disk image — exactly the path AGX V1-M3.5
takes for production VMs (the disk being the output of
oci-converter).

Fix: save a fd to the original virtiofs `/` BEFORE the
block-device pivot. The underlying virtiofs mount stays
alive as long as the fd is open, even though the mountpoint
moves out from under it; ioctl on the saved fd works
identically post-pivot. set_exit_code() prefers the saved fd
when present, else falls back to the original
"open `/` if virtiofs" path so the directory-rootfs
behaviour is unchanged.

The patch is in three small places:

1. Module-level `int agx_exit_code_fd = -1` and a helper
   `agx_save_virtiofs_root_fd()` that opens `/` once and
   stashes the fd.
2. The remount path calls the helper after `mkdir
   /newroot` and before the actual `try_mount` of
   /dev/vda — at this point `/` is still virtiofs.
3. set_exit_code() prefers `agx_exit_code_fd` over its old
   "is_virtiofs check + fresh open" path; both branches
   end with the same ioctl call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
krun_start_enter previously processed regular block devices first
and callback-backed disks second, regardless of host-side add order.
Tests that exercise the Provider→callback path expect their
test-injected callback disk on /dev/vda; with the previous order,
any regular block device (e.g. a rootfs disk via
krun_set_root_disk_remount) would steal /dev/vda and push the
callback disk to /dev/vdb.

Swapping the loops makes callback disks land on lower /dev/vd<N>
slots, restoring the test contract. boot_alpine's rootfs-last
disk-attach pattern relies on this to compute the rootfs's final
device path correctly.
agxd wraps the rootfs in a FileStorage→DirtyTracker→callback
chain so snapshot finalize can harvest the per-snap disk dirty
bitmap (used for diff chains). The remount API previously
refused this layout because the empty-block-cfgs check looked
only at regular `add_disk` entries. Loosen it to also allow
`add_disk_callbacks` rootfs.

One-line guard change. No behavior change for the regular-disk
path (block_cfgs still populated → check passes as before).
The streaming-restore path registers uffd-MISSING on every guest VMA in
register_uffd_missing_all. Kernel uffd is one-fd-per-VMA: the follow-up
auto-registration for uffd-WP returned EBUSY because the same VMA was
already owned by the missing-mode fd.

Skip the WP registration on that path with a logged warn. The
stream-restored VM is not snapshot-able until cold-restarted (snapshot
needs uffd-WP for PAGEMAP_SCAN); operator can stop+start to clear.

Pairs with agxd's new uffd_pager module that owns the missing-mode fd.
The prior order was vcpu_states + restore_vm_state → register_uffd_missing
→ start_vcpus. With AGX_BOOT_UFFD_MISSING_ALL set, restore_vm_state's
reads of guest memory (virtio queue heads, IRQCHIP descriptors) hit
post-DONTNEED zero pages because uffd-missing wasn't wired yet.

Move register_uffd_missing_all to BEFORE restore_vm_state when the env
var is set. The agxd-side controller has the fault thread serving by
the time it sends the ready-byte, so libkrun's read on the same stream
unblocks only after the pager is responsive — restore_vm_state can
fault-in pages on demand from the chain.

The lazy variant (AGX_BOOT_UFFD_MISSING_LAZY) keeps its post-restore
position because it relies on read_full_memory having loaded the full
image first (only the lazy subset gets DONTNEED'd).

Pairs with agxd's StartArgs.console_file plumbing so the post-resume
guest kernel output is recoverable for further debugging.
The agxd continuous-checkpoint streamer's PAGEMAP_SCAN with
PM_SCAN_WP_MATCHING requires the VMM's guest-memory VMAs to
have userfaultfd-WP armed in WP_ASYNC mode. If the VMM is
visible via RUNNING_VMMS (which the control-sock RPCs lookup
through) before uffd-wp is registered, the streamer can race
ahead and PAGEMAP_SCAN returns EACCES against unarmed VMAs on
Yama-restricted kernels.

Fix the ordering: extract the registration body into
register_uffd_wp_with_vmm(ctx_id, &vmm) so krun_start_enter
can call it BEFORE RUNNING_VMMS.lock().insert(...). The
external krun_register_uffd_wp(ctx_id) entry point is kept
unchanged.
AGX-side investigation revealed two separate symptoms tracked at
agxd containers#287 and containers#268:

1. Resumed VM exits cleanly within ~250ms with KVM_EXIT_SHUTDOWN
   and zero console output (triple-fault on first instruction).
2. Long hot-pause (krun_pause for 80+ seconds) trips the guest's
   RCU watchdog on resume.

Both have the same root cause: the guest's pvclock thinks no time
has passed (because KVM faithfully restores tsc_to_system_mul +
system_time_at_snapshot) BUT the wall clock has advanced. On the
next clock-update, the guest's RCU/softlockup watchdogs see "I
haven't ticked for X seconds" → panic → triple-fault → SHUTDOWN.

Fix: call KVM_KVMCLOCK_CTRL per-vcpu at the end of restore_state
and on the Pause→paused transition. The ioctl sets
PVCLOCK_GUEST_STOPPED in the next-emitted pvclock structure;
Linux's guest pvclock driver clears its time-since-last-tick
counters when it sees the flag. Same fix firecracker applies in
arch/x86_64/vcpu.rs::restore_state (line 715). Best-effort —
EINVAL when the guest didn't activate kvmclock is ignored.

Also pin krun-devices's rand back to 0.9.2: the dependabot bump
to 0.10.1 (commit b82735c on main) didn't update call sites,
breaking the build with two API errors. The agx branch already
had source code working with rand 0.9.2 from a prior commit.

Co-author: K. Pivklock (kvmclock_ctrl analysis)
Adds VcpuState::tsc_khz (Option<u32>) and bumps the snapshot
wire format to v2:
- save_state: KVM_GET_TSC_KHZ best-effort. Hosts that don't
  support it leave the field None.
- restore_state: when both saved and host TSC frequencies
  are known and differ by >250 ppm (firecracker's tolerance),
  call KVM_SET_TSC_KHZ to scale the vcpu. On same-host
  restore (V1 default) the values match and this is a no-op.
- snapshot::serialize: writes the 1-byte present flag + u32
  LE value after the cpuid block per vcpu.
- snapshot::deserialize: backwards-compatible — accepts both
  format v1 (no tsc_khz) and v2 (with). v1 maps to None.

Required for V1.1 cross-host live migration where source and
destination hosts may have different TSC frequencies. Mirror
of firecracker's arch/x86_64/vcpu.rs::restore_state TSC
handling. Closes the latent gap behind agx-side task containers#287.
Removed `quiet` from DEFAULT_KERNEL_CMDLINE and added
`loglevel=8 earlyprintk=keep panic_print=0xff` so guest kernel
printk reaches the host's `console_output` file. AGX surfaces
this via per-VM `console.log` for `agx daemon logs <vm>` style
tooling.

Caveat: x86_64 libkrun has no 8250 serial emulation, so
`earlyprintk` has nowhere to go before virtio-console
initializes — early panics during init.krun are still silent.
Tracked as a separate AGX-side todo (8250 emulation or
libkrun-internal tracing).

Upstream keeps `quiet` because libkrun's CLI users typically
don't route console output anywhere, so the chatter is wasted.
…el_path

Three changes in this commit:

1. DEFAULT_KERNEL_CMDLINE adds:
   - earlycon=uart8250,io,0x3f8,115200 — register an early console
     against COM1 from the kernel's first printk
   - earlyprintk=ttyS0,115200,keep — older x86 driver fallback
   - console=ttyS0,115200 — keep ttyS0 active alongside hvc0
   - ignore_loglevel — never filter messages out of the early window

2. builder.rs autoconfigure_console_ports + serial fallback:
   - When console_output is set and no explicit serial is configured
     and disable_implicit_console is false, attach a sink-input +
     append-write-to-file output Serial at COM1. Pairs with the
     earlycon= cmdline so kernel boot output reaches console.log
     before virtio-console probes.
   - Replace File::create with OpenOptions append-only so the
     virtio-console handle does NOT truncate the early-boot bytes
     written by the COM1 serial earlier in build_microvm.

3. krun_set_kernel_path C API + KrunfwBindings::from_path:
   - Take an absolute .kernel-file path and dlopen it instead of
     relying on the SONAME-resolved libkrunfw.so.5. Lets agxd
     ship multiple kernel artifacts (e.g. linux-6.12.76.kernel)
     and pick per-VM via krun_set_kernel_path.
   - init/init.c gains a krun_trace() helper that writes
     "[init.krun] msg" to /dev/console; an opening trace marker
     fires at main() entry to surface the silent window between
     kernel start and config-parsing.
Signed-off-by: Shivansh Vij <shivanshvij@loopholelabs.io>
Earlier commit 1dfba26 unconditionally appended:
  earlycon=uart8250,io,0x3f8,115200 console=ttyS0,115200
  earlyprintk=ttyS0,115200,keep loglevel=8 ignore_loglevel

That broke the integration tests that don't set
console_output: with `earlycon=` in the cmdline but no
matching 8250 emulation behind 0x3f8 (the AGX implicit COM1
fallback only runs when console_output is set), the kernel
poked the I/O port repeatedly with no response, and the
shutdown path no longer cleanly drove KVM_EXIT_SHUTDOWN —
vm-launch hung instead of exiting after `reboot: machine
restart`.

Fix: revert DEFAULT_KERNEL_CMDLINE to the pristine
pre-1dfba26 form (just `panic_print=0xff` over upstream),
and append `earlycon=uart8250,io,0x3f8,115200` from
build_microvm() only when an implicit COM1 is actually
about to be wired (console_output is Some).
agxd / vmm wire extra virtiofs shares (e.g. <vm-dir>/host-share
for the per-session MITM CA cert) and pass tag→mountpoint pairs
via the AGX_VIRTIOFS_MOUNTS env var. init.krun parses it and
mounts each share before exec'ing the entrypoint, so the
guest entrypoint never has to know about them — systemd or
busybox-sh both see /.agx-host/ca.pem as just a regular file
on a pre-existing mount.

Format: AGX_VIRTIOFS_MOUNTS=<tag>:<mountpoint>[,<tag>:<mountpoint>…].
Empty / missing env: no shares, no behavior change.

Best-effort error handling: bad entries log via krun_trace +
perror but never abort init — a missing share is preferable
to a wedged guest.

This replaces docs/TODO.md §1's env-based CA delivery, which
overflowed Linux's hardcoded `COMMAND_LINE_SIZE=2048` (the
base64-encoded cert plus the rest of the cmdline tripped
__fortify_panic in the kernel before init even ran).
Closes the docs/TODO §1 follow-up: supervisor's install_mitm_ca only
runs in supervisor mode, so systemd-as-PID-1 images had the cert
mounted at /.agx-host/ca.pem but never installed at the system trust
store. Doing the install in init.krun before exec'ing the entrypoint
covers every init mode (systemd, agx-supervisor, /bin/sh) uniformly.

Steps (all best-effort, all fail-soft):

1. If AGX_MITM_CA_SOURCE (default /.agx-host/ca.pem) is readable,
   copy it to AGX_MITM_CA_PATH (default
   /usr/local/share/ca-certificates/agx.crt). mkdir-p the parent.

2. If /etc/ssl/certs/ca-certificates.crt exists and doesn't already
   contain "AGX Session CA" (the AGX cert's CN), append the cert
   bytes to that bundle. This is what `update-ca-certificates`
   does, minimised — every libssl/curl/openssl that reads the
   bundle now trusts the AGX CA without us having to invoke a
   shell script from PID 1.

3. Symlink /etc/ssl/certs/agx.pem -> the install path so OpenSSL's
   X509_LOOKUP_hash_dir lookups (the hash-indexed trust dir) also
   find the cert.

When AGX_MITM_CA_SOURCE doesn't exist (e.g. `agx vm start` bare-
spawn with no MITM session), the whole block silently skips —
zero behavior change for non-MITM boots.
Before exec'ing the configured entrypoint, init.krun
checks for `AGX_NETSTACK_STATIC_IP=<ip>/<prefix>:<gw>[:<dns>]`
and configures eth0 via raw ioctls (SIOCSIFADDR,
SIOCSIFNETMASK, SIOCSIFFLAGS, SIOCADDRT). Also writes
/etc/resolv.conf with the DNS server (defaults to <gw>
since the AGX in-process netstack runs the DNS forwarder
on the gateway IP).

Lets OCI-converted images that don't ship a DHCP client
or `ip` binary boot with full network connectivity. No
guest-side prelude script needed — the agent's
entrypoint sees a fully-configured eth0 from instruction
zero.
The HIJACK_INET / HIJACK_UNIX flags do two things at once:
gate explicit-AF_TSI socket(46,...) proxy creation in the
host-side muxer AND inject `tsi_hijack` into the guest
kernel cmdline (so libc routes ALL PF_INET/PF_UNIX through
TSI). AGX wants the first behavior but not the second:
the in-guest binary needs to keep its TPROXY listener on a
normal PF_INET socket while a separate `socket(AF_TSI)`
path handles non-mesh egress, so a global hijack would
break the divert.

Add EXPLICIT_INET (1 << 2) and EXPLICIT_UNIX (1 << 3) bits
that allow proxy creation when the guest explicitly
requests AF_TSI but DO NOT cause builder.rs to add
`tsi_hijack` to the cmdline. With these flags, AGX's
MITM VM can run path-mode virtio-net (the pipeline
bridge) and AF_TSI side-by-side.
…INFO

UnixProxy::process_event used to early-return on HANG_UP without
reading any data still pending in the kernel buffer. When a host
peer does write-then-shutdown (e.g. agxd's per-VM CA-server writes
the cert PEM and immediately closes), epoll_wait can return
IN | HANG_UP in a single notification — the early-return then sent
OP_RST to the guest and discarded the bytes. Guest read() returned
EOF with 0 bytes; supervisor's `install_mitm_ca` retried 23+ times
and gave up. Reorder so recv_pkt drains IN first, then HANG_UP
handles the close — any bytes still pending land in the guest's RX
queue before the RST.

PortOutputLog also drops kernel printk + early-userspace stderr
from log::Level::Error to Level::Info. It was never error output;
the prior level polluted host stderr at error level on every boot
under a default tracing-subscriber EnvFilter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant