Skip to content

transient 500 error regenerating blueprint after adding a sled #7800

Open
@davepacheco

Description

@davepacheco

This is a pretty minor issue at the moment.

In a4x2, I went through the sled add process:

root@oxz_switch:~# omdb nexus sleds list-uninitialized
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
RACK_ID                              CUBBY SERIAL PART  REVISION 
d06667d7-7fa1-44b6-83f8-71b96a64ed4b 2     g2     i86pc 0        

root@oxz_switch:~# omdb -w nexus sleds add g2 i86pc
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
added sled g2 (i86pc): 1952fef7-3115-4f3c-896c-8e934c2372db

root@oxz_switch:~# omdb -w nexus blueprints regenerate
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
Error: generating blueprint

Caused by:
    Error Response: status: 500 Internal Server Error; headers: {"content-type": "application/json", "x-request-id": "1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72", "content-length": "124", "date": "Fri, 14 Mar 2025 20:13:45 GMT"}; value: Error { error_code: Some("Internal"), message: "Internal Server Error", request_id: "1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72" }

From the Nexus log, that 500:

20:13:45.555Z INFO 17799e56-3104-4b30-b8cb-b3943cf802e4 (dropshot_internal): request completed
    error_message_external = Internal Server Error
    error_message_internal = error generating blueprint: sled 1952fef7-3115-4f3c-896c-8e934c2372db: no available zpools for additional InternalNtp zones
    file = /home/dap/.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/dropshot-0.16.0/src/server.rs:855
    latency_us = 7810134
    local_addr = [fd00:1122:3344:101::21]:12221
    method = POST
    remote_addr = [fd00:1122:3344:101::2]:45641
    req_id = 1816f64d-d0e0-4f69-92ab-c4fc4b6a3d72
    response_code = 500
    uri = /deployment/blueprints/regenerate
root@g0:~# 

When I tried again a second later, it worked:

root@oxz_switch:~# omdb -w nexus blueprints regenerate
note: Nexus URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Nexus URL http://[fd00:1122:3344:101::21]:12221
generated new blueprint b25a2e8e-84e2-4f9a-9d09-21fdd24c27af

I suspect this has to do with the fact that the new sled's sled agent is still starting and hasn't reported its zpools to Nexus by the time we start the preparation phase for this planning run. As a result, we don't see any pools on this sled so we can't put the InternalNtp zone that we want to put there.

It's not yet clear to me what we should do better.

  • A 503 error would be marginally better than a 500. This reflects a potentially transient issue where we don't have the resources we need to do something, not a server bug. It wouldn't change the experience much though.
  • We could simply skip this step, allowing the planner to generate a blueprint that just hasn't taken this step. This is probably more consistent with what we do elsewhere in the planner, where if we haven't met the preconditions for taking some step, then we just don't take it. But from an omdb user perspective, this is worse, because you'll think you successfully generated a blueprint to add the sled but you haven't. In the long run, when the planner is automated, this would be fine -- we'll wind up trying again eventually and adding the sled.

I'm not sure we should do anything right now but I wanted to record this in case other folks hit it. (I think maybe @leftwo also reported seeing something like this at some point?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions