Skip to content

Conversation

@estolfo
Copy link
Contributor

@estolfo estolfo commented Oct 28, 2025

This PR adds support for two query params on the root api: wait_for_status and timeout.
They mirror what the same query params do on the elasticsearch cluster health status endpoint.

wait_for_status: One of green, yellow or red.timeout is required along with a status. Will wait (until the timeout provided) until the status of the service changes to the one provided or better, i.e. green > yellow > red.

timeout: Period to wait for the status to reach the requested target status. If the target status is not reached before the timeout expires, the request returns http status 408.

The status of the service will be checked with an exponential backoff until the timeout is reached.

Short description of the behavior:

  • valid timeout is provided with no status: return immediately
  • valid status is provided with no timeout: return error response that timeout is required with status and http status 400
  • invalid status is provided (i.e. not one of [green, yellow, red] - return error response and http status 400
  • invalid timeout is provided (required input is that it's an integer, and with units) - return error response and http status 400
  • valid status is provided with a valid timeout: wait for the given status or a better one (green > yellow > red). When target status or a better one is reached, return normal response
  • valid status is provided with a valid timeout: wait for the given status or a better one (green > yellow > red). When the timeout is reached and neither the target status nor a better one is reached: return error response and http status 408
  • neither status nor timeout provided: return normal response

Open Questions/ToDo:

  • Right now, the implementation doesn't wait for a status that is "better, i.e. green > yellow > red", as the Elasticsearch implementation does. Do we want to adjust our implementation to also have this behavior or is that overkill? The implementation will do the same as Elasticsearch-- it will wait for a status that matches the target or "better".
  • If the timeout is expired on the Elasticsearch cluster health endpoint before the target status is reached (or a better one), the request fails and returns an error. This implementation currently just returns as normal. Do we want to have the same behavior as Elasticsearch? The request will return 503 if the request times out and the target status is not reached, as does Elasticsearch
  • What examples should be used for "host" and "name" in the openapi documentation? The guidelines suggest api.example.com but that doesn't seem to fit the example of calling logstash's root api. Update: used logstash-pipelines, logstash-pipelines.example.com
  • Confirm that the status code 503 should be used when the request times out. Define message based on what elasticsearch returns. Update: testing showed that status code 408 is returned when the request times out.
  • When an invalid timeout or status are provided, status 400 should be used with an error message.
  • Should a timeout unit be required, like for ES? i.e. "1s" for the timeout.

Resolves #17457

@github-actions
Copy link
Contributor

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Oct 28, 2025

This pull request does not have a backport label. Could you fix it @estolfo? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • If no backport is necessary, please add the backport-skip label

Copy link
Member

@yaauie yaauie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the origial requirement is to have a non-200 status code if the status is not met.

@estolfo
Copy link
Contributor Author

estolfo commented Oct 29, 2025

note: handle the unknown status being in the Status enum. Should it be removed from the HEALTH_STATUS constant?
Update: removed unknown as a valid Status.

@estolfo
Copy link
Contributor Author

estolfo commented Nov 4, 2025

Note: I tested with Elasticsearch and found that some assumptions about its behavior were incorrect:

  • Elasticsearch returns HTTP status code 408 when the request times out waiting for the target status, not 503. Fixed in 6fdf433
  • If no timeout is provided, the request blocks indefinitely until the target status is reached. This is surprising given that the documentation says By default, will not wait for any status.

@estolfo
Copy link
Contributor Author

estolfo commented Nov 5, 2025

Update: changed behavior to require a valid timeout with a valid status. This differs from Elasticsearch's behavior; Elasticsearch will wait until the network request timeout for the status if no timeout query param is provided.

@elasticmachine
Copy link
Collaborator

💛 Build succeeded, but was flaky

Failed CI Steps

History

Comment on lines +38 to +44
if input_status
return status_error_response(input_status) unless target_status = parse_status(input_status)
end

if input_timeout
return timeout_error_response(input_timeout) unless timeout_s = parse_timeout_s(input_timeout)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment in the nested modifier condition is easy to lose track of. The Ruby Style Guide calls out wrapping assignment-in-conditionals in parenthesis, which helps a bit:

Suggested change
if input_status
return status_error_response(input_status) unless target_status = parse_status(input_status)
end
if input_timeout
return timeout_error_response(input_timeout) unless timeout_s = parse_timeout_s(input_timeout)
end
if input_status
return status_error_response(input_status) unless (target_status = parse_status(input_status))
end
if input_timeout
return timeout_error_response(input_timeout) unless (timeout_s = parse_timeout_s(input_timeout))
end

But I think that the complexity is still buried.

If we pull the assignment up into the top conditional, I think it meaningfully pulls the complexity of the assignment forward (instead of deferring it to the modifier clause):

Suggested change
if input_status
return status_error_response(input_status) unless target_status = parse_status(input_status)
end
if input_timeout
return timeout_error_response(input_timeout) unless timeout_s = parse_timeout_s(input_timeout)
end
if input_status && !(target_status = parse_status(input_status))
return status_error_response(input_status)
end
if input_timeout && !(timeout_s = parse_timeout_s(input_timeout))
return timeout_error_response(input_timeout)
end

current_status = HEALTH_STATUS.index(agent.health_observer.status.external_value)
break if current_status <= HEALTH_STATUS.index(target_status)

if Time.now > deadline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're already at the deadline, no use spinning another wait cycle:

Suggested change
if Time.now > deadline
if Time.now >= deadline

Comment on lines +88 to +93
if Time.now > deadline
return respond_with(RequestTimeout.new(TIMED_OUT_WAITING_FOR_STATUS_MESSAGE % [target_status]))
end

sleep(wait_interval)
wait_interval = wait_interval * 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are doubling our sleep with each attempt, we risk over-sleeping.

ittr wait_interval effective timeout
1 0.2 0.2
2 0.4 0.6
3 0.8 1.4
4 1.6 3.0
5 3.2 6.2
6 6.4 12.4
7 12.8 25.4
8 25.6 51.0

For example, a request for timeout=30s, and the wait_for_status condition has not been met after ~25.4s, the current code will sleep another 25.6s and not check again until a total of 51s has elapsed.

We can keep the doubling factor and limit the last sleep to no more than the requested amount:

Suggested change
if Time.now > deadline
return respond_with(RequestTimeout.new(TIMED_OUT_WAITING_FOR_STATUS_MESSAGE % [target_status]))
end
sleep(wait_interval)
wait_interval = wait_interval * 2
time_remaining = deadline - Time.now
if time_remaining <= 0
return respond_with(RequestTimeout.new(TIMED_OUT_WAITING_FOR_STATUS_MESSAGE % [target_status]))
end
sleep((time_remaining <= wait_interval) ? time_remaining : wait_interval)
wait_interval = wait_interval * 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set HTTP status code based on status in health report API

3 participants