-
Notifications
You must be signed in to change notification settings - Fork 1.8k
log: Update log related metrics with a certian interval #11084
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log: Update log related metrics with a certian interval #11084
Conversation
Signed-off-by: Hiroshi Hatake <[email protected]>
WalkthroughAdded heartbeat-driven metrics emission with cross-platform monotonic timing, new metrics lifecycle APIs, and heartbeat scheduling fields on the public log struct; heartbeat checks are invoked from the log print path to emit per-type counters at configured intervals. Changes
Sequence DiagramsequenceDiagram
participant Print as flb_log_print()
participant Time as monotonic_now_ns()
participant HB as Heartbeat Check
participant Metrics as flb_log_metrics_heartbeat()
Print->>Time: get current monotonic ns
Print->>HB: check (log->next_hb_ns, current_ns)
alt current_ns >= log->next_hb_ns
HB->>Metrics: emit per-type counters
Metrics-->>HB: update metrics state
HB->>Print: advance log->next_hb_ns
else not yet
HB-->>Print: no-op
end
Print->>Print: increment per-type counter
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/flb_log.c (1)
667-674: Incomplete error cleanup leads to resource leaks.If
flb_log_metrics_create()fails, the error path at lines 670-673 destroys the event loop and frees the log structure, but doesn't clean up thech_mngpipe created at line 644. This leaks the pipe file descriptors.Apply this diff:
log->metrics = flb_log_metrics_create(); if (log->metrics == NULL) { fprintf(stderr, "[log] could not create log metrics\n"); + flb_pipe_destroy(log->ch_mng); mk_event_loop_destroy(log->evl); flb_free(log); config->log = NULL; return NULL; }
Additionally, the error path at lines 694-700 is also incomplete: it should call
flb_log_metrics_destroy(log->metrics)before freeing the log structure.Apply this diff:
ret = flb_log_worker_init(worker); if (ret == -1) { flb_errno(); + flb_log_metrics_destroy(log->metrics); + flb_pipe_destroy(log->ch_mng); mk_event_loop_destroy(log->evl); flb_free(log); config->log = NULL;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
include/fluent-bit/flb_log.h(1 hunks)src/flb_log.c(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_log.c (1)
lib/cmetrics/src/cmt_counter.c (1)
cmt_counter_add(119-135)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-without-cxx (3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: PR - fuzzing test
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
🔇 Additional comments (4)
src/flb_log.c (4)
544-553: LGTM!Proper null checks and cleanup order. The function safely handles null inputs and destroys resources in the correct sequence.
559-609: LGTM!The metrics creation logic is well-structured with proper error handling. Each failure path correctly calls
flb_log_metrics_destroy()to clean up partial state. The use ofcmt_counter_set()for initialization is appropriate.Note: The loop at lines 592-606 has the same range limitation as the heartbeat function (excludes HELP and IDEBUG), which is consistent but worth confirming as intentional.
892-892: LGTM!The heartbeat integration is correctly placed in the log print path. While this adds a small overhead to every log call, the heartbeat function has an early return (line 510 in
flb_log_metrics_heartbeat) that minimizes impact when the heartbeat isn't due. The use of wall-clock time (cfl_time_now()) for metric timestamps while the heartbeat uses monotonic time for scheduling is the correct approach.
963-963: LGTM!The metrics cleanup is properly integrated into the log destruction sequence, ensuring resources are released before the log structure is freed.
| log->hb_interval_ns = 60ULL * 1000000000ULL; | ||
| log->next_hb_ns = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded heartbeat interval should be configurable.
The heartbeat interval is hardcoded to 60 seconds (line 641), but the PR test configuration uses scrape_interval=60s in the fluentbit_metrics input. If users configure a different scrape interval, the hardcoded heartbeat won't align with the expected metric update frequency.
Consider:
- Reading the scrape interval from the
fluentbit_metricsinput configuration - Making the heartbeat interval configurable via a parameter
- Documenting why 60 seconds was chosen if it's intentional to be fixed
Initial heartbeat fires immediately.
Setting next_hb_ns = 0 (line 642) means the first heartbeat will trigger on the first log message, since monotonic_now_ns() returns a positive value. This may be intentional to ensure metrics are emitted early, but consider setting next_hb_ns = monotonic_now_ns() + hb_interval_ns if you want the first heartbeat to fire after one full interval.
🤖 Prompt for AI Agents
In src/flb_log.c around lines 641-642, the heartbeat interval is hardcoded to
60s and next_hb_ns is set to 0; change this so the hb interval is configurable
and the initial heartbeat is scheduled correctly: read the scrape interval from
the fluentbit_metrics input (or expose a new parameter) and use that value (with
a sensible default of 60s) to set log->hb_interval_ns, and set log->next_hb_ns =
monotonic_now_ns() + log->hb_interval_ns so the first heartbeat fires after one
interval (also add a short comment and update docs/config example to note the
new parameter or why 60s is default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a draft proposal so we wouldn't be needed to make configurable for now.
So, we'll choose option 3 for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cosmo0920, understood! Since this is a draft proposal, keeping the hardcoded 60-second interval is perfectly reasonable for now. You can make it configurable later if needed once the approach is validated. Thanks for clarifying! 👍
Signed-off-by: Hiroshi Hatake <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
src/flb_log.c (2)
517-521: Usecontinueinstead ofbreakfor robustness.Line 520 uses
breakwhenmessage_type_stris NULL, which stops processing all remaining log types. While NULL is not expected for the current loop range (FLB_LOG_ERROR through FLB_LOG_TRACE are all defined in the switch), usingcontinuewould be more defensive and allow processing of subsequent types if the loop range or type definitions change in the future.Apply this diff:
if (!message_type_str) { - break; + continue; }
502-528: Consider thread-safety for heartbeat scheduling.The
next_hb_nsfield (lines 513, 527) is read and written without synchronization. Whenflb_log_print()is called concurrently from multiple threads, multiple threads might observe an expirednext_hb_nsand all emit heartbeats simultaneously. Whilecmt_counter_add()is atomic and prevents data corruption, this could result in redundant heartbeat emissions.For improved accuracy, consider adding a compare-and-swap or mutex to ensure only one thread advances
next_hb_nsper interval. However, since duplicate heartbeats are not critical, this optimization can be deferred.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/flb_log.c(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/flb_log.c (1)
lib/cmetrics/src/cmt_counter.c (1)
cmt_counter_add(119-135)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (30)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit, x64, x64-windows-static, 3.31.6)
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 64bit (Arm64), amd64_arm64, -DCMAKE_SYSTEM_NAME=Windows -DCMA...
- GitHub Check: pr-windows-build / call-build-windows-package (Windows 32bit, x86, x86-windows-static, 3.31.6)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_ARROW=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=Off, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_MEMORY=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SIMD=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SANITIZE_THREAD=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_COVERAGE=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_UNDEFINED=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DSANITIZE_ADDRESS=On, 3.31.6, gcc, g++)
- GitHub Check: PR - fuzzing test
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, clang, clang++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_SMALL=On, 3.31.6, gcc, g++)
- GitHub Check: run-ubuntu-unit-tests (-DFLB_JEMALLOC=Off, 3.31.6, clang, clang++)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, clang, clang++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-centos-7
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-24.04, clang-14)
- GitHub Check: pr-compile-system-libs (-DFLB_PREFER_SYSTEM_LIBS=On, 3.31.6, gcc, g++, ubuntu-22.04, clang-12)
- GitHub Check: pr-compile-without-cxx (3.31.6)
🔇 Additional comments (4)
src/flb_log.c (4)
466-491: LGTM: Overflow mitigation is sound.The Windows monotonic time calculation now splits the division to avoid overflow (lines 487-490), addressing the previous review concern. The approach is correct for all practical purposes.
Note: A theoretical overflow still exists if
rem * 1000000000ULLexceedsUINT64_MAX(requires performance counter frequency > 18 billion Hz), but this is extremely unlikely in real-world systems.
493-500: LGTM: Standard POSIX monotonic clock implementation.The non-Windows implementation correctly uses
CLOCK_MONOTONICwithclock_gettime, which is the standard approach for obtaining monotonic timestamps.
644-645: LGTM: Heartbeat configuration is appropriate for draft proposal.The 60-second interval aligns with the test configuration's scrape interval, and setting
next_hb_ns = 0ensures the first heartbeat emits on the first log message, which is sensible for initialization.As discussed in previous reviews, making the interval configurable can be deferred until the draft is finalized.
895-895: LGTM: Heartbeat invocation placement is correct.Calling
flb_log_metrics_heartbeat()before incrementing the log counter ensures metrics are refreshed at the configured interval, which addresses the PR objective of keeping fluentbit_* metrics alive for prometheus_remote_write.Note: This adds a
monotonic_now_ns()call overhead to every log message, even when the heartbeat interval hasn't elapsed. For most workloads this overhead is negligible, but if profiling reveals performance issues with high-frequency logging, consider sampling or caching the monotonic time check.
|
This could be first aid approach. So, we need to reevaluate to mitigate this type of issue. |
Fixes #11082
Enter
[N/A]in the box, if an item is not applicable to your change.Testing
Before we can approve your change; please submit the following in a comment:
With the following configuration and Prometheus:
Launch Prometheus with:
$ prometheus --config.file=/path/to/prometheus.yaml --web.enable-remote-write-receiverIf this is a change to packaging of containers or native binaries then please confirm it works for all targets.
ok-package-testlabel to test for all targets (requires maintainer to do).Documentation
Backporting
Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.
Summary by CodeRabbit