Skip to content

Commit ff852f7

Browse files
committed
WIP reservoir RFD 185
1 parent d9601ce commit ff852f7

File tree

2 files changed

+119
-0
lines changed

2 files changed

+119
-0
lines changed

README.md

+2
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
<!--
88
Copyright 2020 Joyent, Inc.
9+
Copyright 2025 MNX Cloud, Inc.
910
-->
1011

1112
# Requests for Discussion
@@ -228,6 +229,7 @@ formal writing that it has come to represent.)
228229
| draft | [RFD 182 Altering system pool detection in SmartOS/Triton](./rfd/0182/README.md) |
229230
| predraft | [RFD 183 Triton Volume Replication and Backup](./rfd/0183/README.md) |
230231
| draft | [RFD 184 SmartOS BHYVE Image Builder Brand](./rfd/0184/README.md) |
232+
| predraft | [RFD 185 VMM Memory Reservoir Use by SmartOS and Triton](./rfd/0185/README.md) |
231233

232234
## Contents of an RFD
233235

rfd/0185/README.md

+117
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
---
2+
author: Dan McDonald <[email protected]>
3+
state: predraft
4+
discussion: https://github.com/TritonDataCenter/rfd/issues?q=%22RFD+185%22
5+
---
6+
7+
<!--
8+
This Source Code Form is subject to the terms of the Mozilla Public
9+
License, v. 2.0. If a copy of the MPL was not distributed with this
10+
file, You can obtain one at http://mozilla.org/MPL/2.0/.
11+
-->
12+
13+
<!--
14+
Copyright 2025 MNX Cloud, Inc.
15+
-->
16+
17+
# RFD 185 VMM Memory Reservoir Use by SmartOS and Triton
18+
19+
20+
## Problem statement
21+
22+
The VMM memory reservoir landed in illumos-gate in early 2021 with
23+
[illumos#13833](https://illumos.org/issues/13833), and has been enhanced with
24+
[illumos#15372 - want better sizing interface for vmm
25+
reservoir](https://www.illumos.org/issues/15372). It allows a reserved pool
26+
of kernel memory for hardware virtual machines (e.g. BHYVE) to allocate
27+
without needing to thrash and fight other kernel memory allocators like the
28+
ZFS ARC.
29+
30+
Currently SmartOS does not attempt to use the memory reservoir, nor does it
31+
signal its availability in a way that can be exploited by Triton. The unit
32+
of abstraction for use of the VMM reservoir is at bhyve(8) invocation time.
33+
If we wish to use the reservoir, we need to signal at bhyve(8) invocation
34+
whether or not to use it. That signaling will need to percolate up the
35+
SmartOS and Triton stack to one or more of:
36+
37+
- vmadm(8)
38+
39+
- VMAPI
40+
41+
Also, the size of the reservoir for a physical machine is something to be
42+
abstracted up the SmartOS and Triton stack to one or more of:
43+
44+
- GZ configuration (perhaps the `zones` service?).
45+
46+
- CNAPI
47+
48+
Beyond presentation of this new feature, there are several hazards and
49+
trade-offs that must be addressed before it is available to SmartOS or Triton
50+
users.
51+
52+
## Hazards and Trade-offs
53+
54+
### HAZARD: Available Machine Memory and Other Consumers
55+
56+
The VMM reservoir design currently assigns all physical memory discounted by
57+
120% of the unlockable pages count as POSSIBLY available for the reservoir.
58+
The remaining amount MAY be insufficient for normal machine operations.
59+
60+
Examples of surprise memory consumers include:
61+
62+
- ZFS Adaptive Replacement Cache (ARC)
63+
64+
- Device driver preallocation of DMA memory, often at initialization
65+
time. (The `i40e`(4D) driver is a good example, where a lot of VNICs can
66+
eat a lot of memory.)
67+
68+
Any design will need to consider this hazard.
69+
70+
### HAZARD: Newly Created BHYVE VMs
71+
72+
Today, SmartOS never has its BHYVE VMs use the reservoir. The current design
73+
of BHYVE is IF reservoir usages is specified in bhyve's configuration, it
74+
will have all kernel memory allocation use the reservoir. If the reservoir
75+
allocation fails, the bhyve process exits.
76+
77+
Given dynamic machine assignments, we MAY wish to have newly-created VMs
78+
first check reservoir space before launching with reservoir. If we do that,
79+
then if the check fails we can either error out or proceed without reservoir
80+
use. Even if we succeed in the check, there is a chance of a concurrent VM
81+
boot racing us to boot (see below). In such race cases, one may get the
82+
reservoir and one may not, in which case both may continue, or one may fail
83+
if both think reservoir is available to them.
84+
85+
### HAZARD: Triton BHYVE VM Creation And Assignment
86+
87+
For creating BHYVE VMs with Triton's VMAPI, we must make sure that we are not
88+
overprovisioning a compute node. VMAPI and CNAPI don't allow
89+
overprovisioning, but today it does not take into account the other-consumer
90+
hazards mentioned earlier. A few large BHYVE VMs with a large number of
91+
small native or LX zones on a i40e(4D) machine may cause memory exhaustion
92+
rather quickly.
93+
94+
### TRADE-OFF: bhyve failure if using reservoir
95+
96+
In SmartOS, the zhyve command launches bhyve as the zone's init(8) process.
97+
zhyve can enable/disable use of the reservoir on a bhyve invocation. The
98+
tricky part comes if bhyve fails to launch because in spite of any zhyve
99+
checking, the reservoir becomes unavailable. We need to determine if bhyve
100+
failure should cause a relaunch of zhyve with an automatic downgrade to
101+
no-resevoir, or if it should outright fail, and let the administrator make a
102+
decision on what to do.
103+
104+
## Proposed solution
105+
106+
Rough outline:
107+
108+
- Boot-time tunable for setting the reservoir (can be adjusted using
109+
/usr/lib/rsrvrctl). Probably stored in zones SMF configuration somewhere.
110+
Default to 75% of the max-reported available-for-reservoir RAM by the vmm
111+
subsystem.
112+
113+
- zhyve will do a propolis-style query of the reservoir, and set the
114+
use-reservoir configuration accordingly before launching bhyve.
115+
116+
- If launching fails, zhyve will just outright fail, like any other time.
117+

0 commit comments

Comments
 (0)