-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Reload race condition causes persistent 500 responses on /aggregated_metrics endpoint #5310
Description
Describe the bug
When fluentd receives SIGHUP (e.g. via systemctl reload fluentd.service) to reload its
configuration, worker threads restart in-place. A new worker thread hits Errno::EADDRINUSE
when trying to bind its port because the old thread has not yet fully released the socket. The
crashing thread terminates silently, leaving the fluent-plugin-prometheus HTTP server on port
24231 running but unable to collect stats from the dead thread. All subsequent requests to the
/aggregated_metrics endpoint return HTTP 500 with body
"Connection refused - connect(2) for 127.0.0.1:<port>".
To Reproduce
- Install
fluent-packagewith the following configuration (see Your Configuration below) - Start Fluentd:
systemctl start fluentd - Send a reload signal:
systemctl reload fluentd.service(sends SIGHUP) - Observe the warn log for
EADDRINUSEon one of the worker threads - Request
http://localhost:24231/aggregated_metrics— it returns HTTP 500 with body
"Connection refused - connect(2) for 127.0.0.1:<port>"
Expected behavior
After a reload (SIGHUP), all worker threads should successfully rebind their ports and Fluentd should continue serving /aggregated_metrics with HTTP 200. If a thread cannot rebind, it should either retry or fall back gracefully rather than crashing and leaving the supervisor in a broken state.
Your Environment
- Fluentd version: fluentd 1.19.1 (efdc4dca81c23480c9b55e13e55de6aa925b1cf5)
- Package version: fluent-package 6.0.1
- Operating system: Ubuntu 24.04.3
- Kernel version: 6.8.0-94-genericYour Configuration
<system>
workers 4
</system>
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>
<source>
@type prometheus_output_monitor
</source>
<source>
@type prometheus_monitor
</source>
<source>
@type http
bind 127.0.0.1
port 24224
<parse>
@type json
</parse>
</source>
<match **>
@type null
</match>Your Error Log
2026-03-31 22:47:09 +0000 [warn]: #3 0.07s: Async::Task
| Task may have ended with unhandled exception.
| Errno::EADDRINUSE: Address already in use - bind
| → /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'Socket#bind'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/wrapper.rb:152 in 'IO::Endpoint::Wrapper#bind'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:68 in 'block in IO::Endpoint::HostEndpoint#bind'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Array#each'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerator#each'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'Enumerable#map'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/host_endpoint.rb:67 in 'IO::Endpoint::HostEndpoint#bind'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/endpoint.rb:216 in 'Async::HTTP::Endpoint#bind'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/io-endpoint-0.15.2/lib/io/endpoint/generic.rb:82 in 'IO::Endpoint::Generic#accept'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/async-http-0.89.0/lib/async/http/server.rb:67 in 'block in Async::HTTP::Server#run'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:200 in 'block in Async::Task#run'
| /opt/fluent/lib/ruby/gems/3.4.0/gems/async-2.24.0/lib/async/task.rb:438 in 'block in Async::Task#schedule'Additional context
The crash leaves the fluent-plugin-prometheus /aggregated_metrics endpoint permanently
returning HTTP 500 (body: "Connection refused - connect(2) for 127.0.0.1:<port>") until
Fluentd is fully restarted. The affected thread is one of the per-worker Prometheus stats
collectors that the aggregation endpoint queries internally.
The stack trace points to io-endpoint-0.15.2 and async-http-0.89.0 — the new thread starts
its HTTP server bind before the old thread's socket is fully closed, suggesting either a missing
SO_REUSEPORT/SO_REUSEADDR option or insufficient drain time before rebinding during a
SIGHUP-triggered reload.
Our workaround is to avoid systemctl reload (SIGHUP) in favor of systemctl restart (full
stop + start), which guarantees the old process is dead and all sockets released before the new
one starts.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status