Skip to content

Reload race condition causes persistent 500 responses on /aggregated_metrics endpoint #5310

@roryabraham

Description

@roryabraham

Describe the bug

When fluentd receives SIGHUP (e.g. via systemctl reload fluentd.service) to reload its
configuration, worker threads restart in-place. A new worker thread hits Errno::EADDRINUSE
when trying to bind its port because the old thread has not yet fully released the socket. The
crashing thread terminates silently, leaving the fluent-plugin-prometheus HTTP server on port
24231 running but unable to collect stats from the dead thread. All subsequent requests to the
/aggregated_metrics endpoint return HTTP 500 with body
"Connection refused - connect(2) for 127.0.0.1:<port>".

To Reproduce

  1. Install fluent-package with the following configuration (see Your Configuration below)
  2. Start Fluentd: systemctl start fluentd
  3. Send a reload signal: systemctl reload fluentd.service (sends SIGHUP)
  4. Observe the warn log for EADDRINUSE on one of the worker threads
  5. Request http://localhost:24231/aggregated_metrics — it returns HTTP 500 with body
    "Connection refused - connect(2) for 127.0.0.1:<port>"

Expected behavior

After a reload (SIGHUP), all worker threads should successfully rebind their ports and Fluentd should continue serving /aggregated_metrics with HTTP 200. If a thread cannot rebind, it should either retry or fall back gracefully rather than crashing and leaving the supervisor in a broken state.

Your Environment

- Fluentd version: fluentd 1.19.1 (efdc4dca81c23480c9b55e13e55de6aa925b1cf5)
- Package version: fluent-package 6.0.1
- Operating system: Ubuntu 24.04.3
- Kernel version: 6.8.0-94-generic

Your Configuration

<system>
  workers 4
</system>

<source>
  @type prometheus
  bind 0.0.0.0
  port 24231
  metrics_path /metrics
</source>

<source>
  @type prometheus_output_monitor
</source>

<source>
  @type prometheus_monitor
</source>

<source>
  @type http
  bind 127.0.0.1
  port 24224
  <parse>
    @type json
  </parse>
</source>

<match **>
  @type null
</match>

Your Error Log

2026-03-31 22:47:09 +0000 [warn]: #3  0.07s: Async::Task
      | Task may have ended with unhandled exception.
      |   Errno::EADDRINUSE: Address already in use - bind
      |   → /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'Socket#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'IO::Endpoint::Wrapper#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:68 in 'block in IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Array#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerator#each'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerable#map'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'IO::Endpoint::HostEndpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/endpoint.rb:216 in 'Async::HTTP::Endpoint#bind'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/generic.rb:82 in 'IO::Endpoint::Generic#accept'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/server.rb:67 in 'block in Async::HTTP::Server#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:200 in 'block in Async::Task#run'
      |     /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:438 in 'block in Async::Task#schedule'

Additional context

The crash leaves the fluent-plugin-prometheus /aggregated_metrics endpoint permanently
returning HTTP 500 (body: "Connection refused - connect(2) for 127.0.0.1:<port>") until
Fluentd is fully restarted. The affected thread is one of the per-worker Prometheus stats
collectors that the aggregation endpoint queries internally.

The stack trace points to io-endpoint-0.15.2 and async-http-0.89.0 — the new thread starts
its HTTP server bind before the old thread's socket is fully closed, suggesting either a missing
SO_REUSEPORT/SO_REUSEADDR option or insufficient drain time before rebinding during a
SIGHUP-triggered reload.

Our workaround is to avoid systemctl reload (SIGHUP) in favor of systemctl restart (full
stop + start), which guarantees the old process is dead and all sockets released before the new
one starts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    waiting-for-userSimilar to "moreinfo", but especially need feedback from user

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions