Open
Description
This is a pretty minor issue at the moment.
In a4x2, I went through the sled add process:
root@oxz_switch:~# omdb nexus sleds list-uninitialized
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
RACK_ID CUBBY SERIAL PART REVISION
d06667d7-7fa1-44b6-83f8-71b96a64ed4b 2 g2 i86pc 0
root@oxz_switch:~# omdb -w nexus sleds add g2 i86pc
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
added sled g2 (i86pc): 1952fef7-3115-4f3c-896c-8e934c2372db
root@oxz_switch:~# omdb -w nexus blueprints regenerate
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
Error: generating blueprint
Caused by:
Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72", "content-length": "124", "date": "Fri, 14 Mar 2025 20:13:45 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72" }
From the Nexus log, that 500:
20:13:45.555Z INFO 17799e56-3104-4b30-b8cb-b3943cf802e4 (dropshot_internal): request completed
error_message_external = Internal Server Error
error_message_internal = error generating blueprint: sled 1952fef7-3115-4f3c-896c-8e934c2372db: no available zpools for additional InternalNtp zones
file = /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:855
latency_us = 7810134
local_addr = [fd00:1122:3344:101::21]:12221
method = POST
remote_addr = [fd00:1122:3344:101::2]:45641
req_id = 1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72
response_code = 500
uri = /deployment/blueprints/regenerate
root@g0:~#
When I tried again a second later, it worked:
root@oxz_switch:~# omdb -w nexus blueprints regenerate
note: Nexus URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
generated new blueprint b25a2e8e-84e2-4f9a-9d09-21fdd24c27af
I suspect this has to do with the fact that the new sled's sled agent is still starting and hasn't reported its zpools to Nexus by the time we start the preparation phase for this planning run. As a result, we don't see any pools on this sled so we can't put the InternalNtp zone that we want to put there.
It's not yet clear to me what we should do better.
- A 503 error would be marginally better than a 500. This reflects a potentially transient issue where we don't have the resources we need to do something, not a server bug. It wouldn't change the experience much though.
- We could simply skip this step, allowing the planner to generate a blueprint that just hasn't taken this step. This is probably more consistent with what we do elsewhere in the planner, where if we haven't met the preconditions for taking some step, then we just don't take it. But from an omdb user perspective, this is worse, because you'll think you successfully generated a blueprint to add the sled but you haven't. In the long run, when the planner is automated, this would be fine -- we'll wind up trying again eventually and adding the sled.
I'm not sure we should do anything right now but I wanted to record this in case other folks hit it. (I think maybe @leftwo also reported seeing something like this at some point?)
Metadata
Metadata
Assignees
Labels
No labels