|
| 1 | +--- |
| 2 | +author: Dan McDonald <[email protected]> |
| 3 | +state: predraft |
| 4 | +discussion: https://github.com/TritonDataCenter/rfd/issues?q=%22RFD+185%22 |
| 5 | +--- |
| 6 | + |
| 7 | +<!-- |
| 8 | + This Source Code Form is subject to the terms of the Mozilla Public |
| 9 | + License, v. 2.0. If a copy of the MPL was not distributed with this |
| 10 | + file, You can obtain one at http://mozilla.org/MPL/2.0/. |
| 11 | +--> |
| 12 | + |
| 13 | +<!-- |
| 14 | + Copyright 2025 MNX Cloud, Inc. |
| 15 | +--> |
| 16 | + |
| 17 | +# RFD 185 VMM Memory Reservoir Use by SmartOS and Triton |
| 18 | + |
| 19 | + |
| 20 | +## Problem statement |
| 21 | + |
| 22 | +The VMM memory reservoir landed in illumos-gate in early 2021 with |
| 23 | +[illumos#13833](https://illumos.org/issues/13833), and has been enhanced with |
| 24 | +[illumos#15372 - want better sizing interface for vmm |
| 25 | +reservoir](https://www.illumos.org/issues/15372). It allows a reserved pool |
| 26 | +of kernel memory for hardware virtual machines (e.g. BHYVE) to allocate |
| 27 | +without needing to thrash and fight other kernel memory allocators like the |
| 28 | +ZFS ARC. |
| 29 | + |
| 30 | +Currently SmartOS does not attempt to use the memory reservoir, nor does it |
| 31 | +signal its availability in a way that can be exploited by Triton. The unit |
| 32 | +of abstraction for use of the VMM reservoir is at bhyve(8) invocation time. |
| 33 | +If we wish to use the reservoir, we need to signal at bhyve(8) invocation |
| 34 | +whether or not to use it. That signaling will need to percolate up the |
| 35 | +SmartOS and Triton stack to one or more of: |
| 36 | + |
| 37 | +- vmadm(8) |
| 38 | + |
| 39 | +- VMAPI |
| 40 | + |
| 41 | +Also, the size of the reservoir for a physical machine is something to be |
| 42 | +abstracted up the SmartOS and Triton stack to one or more of: |
| 43 | + |
| 44 | +- GZ configuration (perhaps the `zones` service?). |
| 45 | + |
| 46 | +- CNAPI |
| 47 | + |
| 48 | +Beyond presentation of this new feature, there are several hazards and |
| 49 | +trade-offs that must be addressed before it is available to SmartOS or Triton |
| 50 | +users. |
| 51 | + |
| 52 | +## Hazards and Trade-offs |
| 53 | + |
| 54 | +### HAZARD: Available Machine Memory and Other Consumers |
| 55 | + |
| 56 | +The VMM reservoir design currently assigns all physical memory discounted by |
| 57 | +120% of the unlockable pages count as POSSIBLY available for the reservoir. |
| 58 | +The remaining amount MAY be insufficient for normal machine operations. |
| 59 | + |
| 60 | +Examples of surprise memory consumers include: |
| 61 | + |
| 62 | +- ZFS Adaptive Replacement Cache (ARC) |
| 63 | + |
| 64 | +- Device driver preallocation of DMA memory, often at initialization |
| 65 | + time. (The `i40e`(4D) driver is a good example, where a lot of VNICs can |
| 66 | + eat a lot of memory.) |
| 67 | + |
| 68 | +Any design will need to consider this hazard. |
| 69 | + |
| 70 | +### HAZARD: Newly Created BHYVE VMs |
| 71 | + |
| 72 | +Today, SmartOS never has its BHYVE VMs use the reservoir. The current design |
| 73 | +of BHYVE is IF reservoir usages is specified in bhyve's configuration, it |
| 74 | +will have all kernel memory allocation use the reservoir. If the reservoir |
| 75 | +allocation fails, the bhyve process exits. |
| 76 | + |
| 77 | +Given dynamic machine assignments, we MAY wish to have newly-created VMs |
| 78 | +first check reservoir space before launching with reservoir. If we do that, |
| 79 | +then if the check fails we can either error out or proceed without reservoir |
| 80 | +use. Even if we succeed in the check, there is a chance of a concurrent VM |
| 81 | +boot racing us to boot (see below). In such race cases, one may get the |
| 82 | +reservoir and one may not, in which case both may continue, or one may fail |
| 83 | +if both think reservoir is available to them. |
| 84 | + |
| 85 | +### HAZARD: Triton BHYVE VM Creation And Assignment |
| 86 | + |
| 87 | +For creating BHYVE VMs with Triton's VMAPI, we must make sure that we are not |
| 88 | +overprovisioning a compute node. VMAPI and CNAPI don't allow |
| 89 | +overprovisioning, but today it does not take into account the other-consumer |
| 90 | +hazards mentioned earlier. A few large BHYVE VMs with a large number of |
| 91 | +small native or LX zones on a i40e(4D) machine may cause memory exhaustion |
| 92 | +rather quickly. |
| 93 | + |
| 94 | +### TRADE-OFF: bhyve failure if using reservoir |
| 95 | + |
| 96 | +In SmartOS, the zhyve command launches bhyve as the zone's init(8) process. |
| 97 | +zhyve can enable/disable use of the reservoir on a bhyve invocation. The |
| 98 | +tricky part comes if bhyve fails to launch because in spite of any zhyve |
| 99 | +checking, the reservoir becomes unavailable. We need to determine if bhyve |
| 100 | +failure should cause a relaunch of zhyve with an automatic downgrade to |
| 101 | +no-resevoir, or if it should outright fail, and let the administrator make a |
| 102 | +decision on what to do. |
| 103 | + |
| 104 | +## Proposed solution |
| 105 | + |
| 106 | +Rough outline: |
| 107 | + |
| 108 | +- Boot-time tunable for setting the reservoir (can be adjusted using |
| 109 | + /usr/lib/rsrvrctl). Probably stored in zones SMF configuration somewhere. |
| 110 | + Default to 75% of the max-reported available-for-reservoir RAM by the vmm |
| 111 | + subsystem. |
| 112 | + |
| 113 | +- zhyve will do a propolis-style query of the reservoir, and set the |
| 114 | + use-reservoir configuration accordingly before launching bhyve. |
| 115 | + |
| 116 | +- If launching fails, zhyve will just outright fail, like any other time. |
| 117 | + |
0 commit comments