Harden data races between threaded input workers and the main engine by erain · Pull Request #12007 · fluent/fluent-bit

erain · 2026-06-26T21:17:29Z

Summary

Three data races between threaded input worker threads and the main
engine thread, all surfaced by ThreadSanitizer while running a threaded
tail input (with multiline) alongside the metrics exporter. They are
undefined under the C memory model and benign on the supported hardware, but
worth closing:

metrics — flb_metrics_sum() updates a counter from the owning
input/output worker thread while the metrics exporter reads the same counter
from the main thread. This one is live whenever metrics are scraped together
with a threaded input.
input — flb_input_collector_fd() runs on the main thread but iterated a
threaded input's collectors, racing the worker thread that initializes those
collector fds at startup. A threaded input's collectors are registered and
dispatched in the input's own thread/event loop and are never matched here,
so the handler now skips threaded inputs (also a small optimization).
engine / bin — config->grace_input is published by the engine thread
during startup and read by the supervisor on the main thread.

A small include/fluent-bit/flb_atomic.h helper is added (relaxed GCC/Clang
__atomic builtins, plain-access fallback) and used for the metric counter and
grace_input.

Note: these are not the ARM64 SIGSEGV in
flb_input_chunk_ring_buffer_collector. That one is a separate unsynchronized
SPSC hand-off that is already serialized by the flb_ring_buffer mutex on
master. These changes were found with the same ThreadSanitizer setup while
investigating threaded-mode data races (cf. #9835).

Testing

Built with ThreadSanitizer (-fsanitize=thread, jemalloc disabled) and run for
~45s against a threaded tail + multiline.parser cri + null pipeline driven
by a CRI log generator (~6M lines, 6 files):

Before: TSan reports exactly these 3 races (flb_metrics_sum,
flb_input_collector_fd, grace_input).
After: 0 ThreadSanitizer reports, clean shutdown, no crash.

Example configuration used:

[SERVICE]
    flush     0.2
    log_level info

[INPUT]
    name              tail
    path              /tmp/repro/logs/*.log
    read_from_head    true
    threaded          on
    multiline.parser  cri
    tag               kube.*

[OUTPUT]
    name   null
    match  *

Example configuration file for the change (above)
Debug log output from testing the change (ThreadSanitizer before/after summarized above; full logs available on request)
[N/A] Valgrind output — ThreadSanitizer is the appropriate tool for these data races; memcheck does not detect them. TSan before/after provided instead.

Documentation

[N/A] Documentation required for this feature (no user-facing/behavioral change)

Backporting

Backport to latest stable release (safe, self-contained; at maintainers' discretion)

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

Bug Fixes
- Improved thread-safe handling for shared runtime values using relaxed atomic operations.
- Metric counters now update and display/export via atomic relaxed fetch/load to reduce data races.
- Collector discovery during initialization now skips threaded collectors to avoid unstable concurrent access.
- Supervisor grace-period state updates now use a consistent atomic snapshot across threads.
New Features
- Added a shared atomic utility interface to support safe cross-thread scalar operations across supported compilers.

coderabbitai · 2026-06-26T21:17:54Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e4e4169-3d2a-4f98-a3ca-8fb49389bfc4

📥 Commits

Reviewing files that changed from the base of the PR and between 144901f and 6395b8c.

📒 Files selected for processing (5)

include/fluent-bit/flb_atomic.h
src/flb_engine.c
src/flb_input.c
src/flb_metrics.c
src/fluent-bit.c

🚧 Files skipped from review as they are similar to previous changes (3)

src/flb_input.c
src/flb_metrics.c
src/flb_engine.c

📝 Walkthrough

Walkthrough

Adds relaxed atomic helpers and updates shared scalar reads and writes in engine, metrics, input, and supervisor paths.

Changes

Shared state atomics

Layer / File(s)	Summary
Atomic helper header `include/fluent-bit/flb_atomic.h`	Defines relaxed load, store, and fetch-add macros with compiler builtins, an MSVC backend, and a compile-time failure for unsupported compilers.
Grace input publication `src/flb_engine.c`, `src/fluent-bit.c`	Stores `config->grace_input` atomically in the engine and reads the published value atomically in the supervisor update path.
Metric counter atomics `src/flb_metrics.c`	Uses atomic fetch-add for accumulation and atomic loads when printing and dumping metric values.
Threaded collector skip `src/flb_input.c`	Skips threaded input instances in `flb_input_collector_fd()` during the collector scan for the main-thread fd path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

cosmo0920

Poem

Hop hop, I found a safer way to share,
relaxed atoms glinting in the air.
Metrics count, and grace hops cleanly by,
while threaded fds drift past my bunny eye.
Thump thump, concurrency’s carrot feels fair.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 10.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the PR’s main goal of hardening cross-thread data races in the engine path.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/fluent-bit/flb_atomic.h`:
- Around line 44-53: The fallback in flb_atomic.h currently maps
flb_atomic_load, flb_atomic_store, and flb_atomic_fetch_add to plain accesses,
which is unsafe for shared cross-thread state. Update the `#else` branch to use a
real atomic implementation for non-GCC/Clang compilers, or fail the build with
an explicit error if no atomic backend is available. Keep the fix scoped to the
flb_atomic_* macros so all existing call sites get proper atomic semantics.

In `@src/flb_engine.c`:
- Around line 1138-1139: The grace window update in flb_engine_started setup is
happening too late, so the main thread can observe a stale grace_input value
after FLB_ENGINE_STARTED is published. Move the flb_atomic_store for
config->grace_input to occur before calling flb_engine_started(config), keeping
the update in the startup path around the existing grace_input logic so
src/fluent-bit.c reads the correct supervisor grace window.

In `@src/fluent-bit.c`:
- Around line 1477-1479: The supervisor grace publication is only using an
atomic load in the fixed path, but the later hot-reload paths in the same
function still read ctx->config->grace_input directly and can reintroduce the
race. Update those additional grace publication sites in the function that
handles hot reloads to use flb_atomic_load on ctx->config->grace_input before
calling flb_supervisor_child_update_grace, keeping the access pattern consistent
everywhere in this flow.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb82bdf9-9dc4-4eb0-ab1c-9e06c68eaf41

📥 Commits

Reviewing files that changed from the base of the PR and between f317143 and 144901f.

📒 Files selected for processing (5)

include/fluent-bit/flb_atomic.h
src/flb_engine.c
src/flb_input.c
src/flb_metrics.c
src/fluent-bit.c

flb_metrics_sum() updates a counter from the owning input or output worker thread while the metrics exporter reads the same counter from the main engine thread. With a threaded input this is a data race reported by ThreadSanitizer; it is benign on the supported hardware but undefined under the C memory model. Add a small flb_atomic.h helper with relaxed load/store/fetch_add and use it for the metric value on both the summing and the reading paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yu Yi <yiyu@google.com>

flb_input_collector_fd() runs on the main engine thread but iterated the collectors of every input, including threaded ones. A threaded input initializes its collector descriptors from its own worker thread, so this races with the main thread reading them at startup. Those collectors are registered and dispatched in the input's own thread and event loop, so they are never matched here. Skipping threaded inputs removes the race and avoids needless iteration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yu Yi <yiyu@google.com>

config->grace_input is written once by the engine thread during startup and read concurrently by the supervisor on the main thread, which ThreadSanitizer reports as a data race. Store it with a relaxed atomic; the matching atomic read is done at the supervisor entry point. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yu Yi <yiyu@google.com>

The supervisor entry point reads config->grace_input, which the engine thread publishes during startup. Read it with a relaxed atomic so the cross-thread access is well defined, matching the engine-side store. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yu Yi <yiyu@google.com>

erain requested review from cosmo0920 and edsiper as code owners June 26, 2026 21:17

github-actions Bot added the docs-required label Jun 26, 2026

erain temporarily deployed to pr June 26, 2026 21:17 — with GitHub Actions Inactive

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread include/fluent-bit/flb_atomic.h Outdated

Comment thread src/flb_engine.c Outdated

Comment thread src/fluent-bit.c

erain temporarily deployed to pr June 26, 2026 21:48 — with GitHub Actions Inactive

erain and others added 4 commits June 27, 2026 20:59

erain force-pushed the upstream-pr/threaded-input-data-races branch from 144901f to 6395b8c Compare June 28, 2026 01:00

erain temporarily deployed to pr June 28, 2026 01:00 — with GitHub Actions Inactive

erain temporarily deployed to pr June 28, 2026 01:31 — with GitHub Actions Inactive

erain deployed to pr June 28, 2026 01:31 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Harden data races between threaded input workers and the main engine#12007

Harden data races between threaded input workers and the main engine#12007
erain wants to merge 4 commits into
fluent:masterfrom
erain:upstream-pr/threaded-input-data-races

erain commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

erain commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

erain commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading