Skip to content

out_loki: Handle loki tenant_id_key properly#11832

Open
cosmo0920 wants to merge 6 commits into
masterfrom
cosmo0920-handle-loki-tenant-id-properly
Open

out_loki: Handle loki tenant_id_key properly#11832
cosmo0920 wants to merge 6 commits into
masterfrom
cosmo0920-handle-loki-tenant-id-properly

Conversation

@cosmo0920
Copy link
Copy Markdown
Contributor

@cosmo0920 cosmo0920 commented May 21, 2026

In the previous implementation, we just peek the first attempted value of tenant_id_key in the same chunk contents.
Instead, we have to peek inside of the chunks. This is because some of the chunks include the different tenant_id in the same chunk.

Closes #11824.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

fluent/fluent-bit-docs#2581

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

Summary by CodeRabbit

  • Improvements

    • Logs are grouped per tenant and sent as separate payloads when a tenant key is configured; single-payload behavior remains when not set.
    • More consistent payload handling, cleanup, and retry signaling across tenant-scoped deliveries.
  • New Features

    • New configuration option to control flush outcome on mixed-tenant results (partial_success vs partial_error).
    • Payload submission explicitly scoped per tenant via request headers.
  • Tests

    • End-to-end tests validating tenant-based request splitting and both error-handling modes.

Review Change Stack

cosmo0920 added 2 commits May 21, 2026 15:49
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR replaces dynamic per-thread tenant tracking with per-flush tenant grouping: it extracts tenant IDs from records, groups records by tenant during flush, composes tenant-filtered payloads, and sends one HTTP POST per tenant group with configurable error-handling.

Changes

Loki Tenant Grouping Refactor

Layer / File(s) Summary
Tenant grouping data structures and helpers
plugins/out_loki/loki.c
New flb_loki_tenant_group and helpers extract tenant IDs from records, implement null-safe tenant comparison and effective-tenant selection, and remove pre–remove-keys tenant-id mutation in pack_record().
Initialization, config parsing, and struct updates
plugins/out_loki/loki.c, plugins/out_loki/loki.h
Remove dynamic-tenant TLS/list fields and their init/cleanup; initialize remove-MPA TLS only; add parsing for tenant_id_key_error_handling in loki_config_create(); extend struct flb_loki with tenant_id_key_error_handling and out_tenant_id_key_error_handling; add FLB_LOKI_TENANT_ID_KEY_ERROR_* macros and config_map entry.
Payload composition with per-record tenant filtering
plugins/out_loki/loki.c
loki_compose_payload() signature now accepts tenant_filter/filter_tenant; per-record tenant extraction and tenant_id_matches() gate whether records are packed during both cached-labels and non-cached-labels composition paths; cb_loki_format_test updated accordingly.
Tenant-group payload send and HTTP submission
plugins/out_loki/loki.c
Add collect_tenant_groups(), tenant_group_* helpers, and send_loki_payload() to optionally gzip, create HTTP requests, set auth and X-Scope-OrgID from the explicit tenant_id argument, and consistently cleanup and return retry on failures.
Flush orchestration and per-tenant dispatch
plugins/out_loki/loki.c
cb_loki_flush rewritten: if tenant_id_key unset, compose/send a single payload; otherwise collect tenant groups, compose/send one payload per group, and aggregate per-group outcomes according to tenant_id_key_error_handling.
Runtime tests for tenant behavior
tests/runtime/out_loki.c
Add mock tenant-policy server, capture buffers, helpers to validate captured tenant payloads, and three integration tests: tenant split, partial_success, and partial_error modes; register tests in TEST_LIST.

Sequence Diagram

sequenceDiagram
  participant FlushCB as cb_loki_flush
  participant GroupHelper as collect_tenant_groups
  participant PayloadCompose as loki_compose_payload
  participant SendPayload as send_loki_payload
  participant HTTPClient as HTTP Client

  FlushCB->>GroupHelper: Check if tenant_id_key configured
  alt No tenant_id_key
    GroupHelper-->>FlushCB: Single payload (no grouping)
    FlushCB->>PayloadCompose: Compose with tenant_filter=NULL
    PayloadCompose-->>FlushCB: Payload buffer
    FlushCB->>SendPayload: Send with default tenant_id
    SendPayload->>HTTPClient: POST with X-Scope-OrgID (default)
  else tenant_id_key configured
    GroupHelper->>GroupHelper: Iterate records, extract effective tenant IDs
    GroupHelper-->>FlushCB: Array of tenant groups with record counts
    loop For each tenant group
      FlushCB->>PayloadCompose: Compose with tenant_filter=group_tenant_id
      PayloadCompose-->>FlushCB: Tenant-filtered payload
      FlushCB->>SendPayload: Send with group_tenant_id
      SendPayload->>HTTPClient: POST with X-Scope-OrgID (group tenant)
    end
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • edsiper

Poem

🐰 A tenant hops into a flush so neat,
Records gathered by ID, no threads to meet,
One payload each, headers set just right,
Tests listen close through the mock server night.
Hooray — grouped sends keep the logs polite!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 4.65% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'out_loki: Handle loki tenant_id_key properly' clearly summarizes the main change: fixing proper handling of tenant_id_key in the Loki output plugin.
Linked Issues check ✅ Passed The PR directly addresses issue #11824 by implementing per-record tenant routing instead of chunk-level tenant inspection, enabling correct multi-tenant log distribution to Loki.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing tenant_id_key handling: modifications to loki.c/loki.h plugin code and corresponding runtime tests validate the new functionality.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cosmo0920-handle-loki-tenant-id-properly

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 47a2f77672

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread plugins/out_loki/loki.c Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/runtime/out_loki.c (1)

76-101: 💤 Low value

The callback ordering dependency is implicit.

The headers callback captures data at slot = tenant_request_count, then the payload callback captures at the same slot and increments the count. This assumes headers callback is always invoked before payload callback for each request. While this matches Fluent Bit's current HTTP client behavior, consider adding a brief comment documenting this assumption.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/runtime/out_loki.c` around lines 76 - 101, The callbacks
cb_loki_debug_headers and cb_loki_debug_payload rely on an implicit ordering
(headers callback runs before payload callback) because cb_loki_debug_headers
records tenant_headers at slot = tenant_request_count while
cb_loki_debug_payload writes tenant_payloads at the same slot and then
increments tenant_request_count; add a concise comment above these functions (or
at their shared mutex/slot logic) explicitly documenting this assumption and
referencing the dependency between tenant_request_count, cb_loki_debug_headers,
and cb_loki_debug_payload so future readers know the ordering is relied upon and
why.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/runtime/out_loki.c`:
- Around line 76-101: The callbacks cb_loki_debug_headers and
cb_loki_debug_payload rely on an implicit ordering (headers callback runs before
payload callback) because cb_loki_debug_headers records tenant_headers at slot =
tenant_request_count while cb_loki_debug_payload writes tenant_payloads at the
same slot and then increments tenant_request_count; add a concise comment above
these functions (or at their shared mutex/slot logic) explicitly documenting
this assumption and referencing the dependency between tenant_request_count,
cb_loki_debug_headers, and cb_loki_debug_payload so future readers know the
ordering is relied upon and why.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2ea6b796-3361-4827-9545-b2c16a9980cb

📥 Commits

Reviewing files that changed from the base of the PR and between bcc2436 and 47a2f77.

📒 Files selected for processing (3)
  • plugins/out_loki/loki.c
  • plugins/out_loki/loki.h
  • tests/runtime/out_loki.c
💤 Files with no reviewable changes (1)
  • plugins/out_loki/loki.h

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/runtime/out_loki.c (1)

278-283: 💤 Low value

Missing flb_http_server_stop call on pthread_create failure.

If pthread_create fails after the HTTP server was successfully started on line 268, the server is destroyed without being stopped first. This could leak listening resources.

Suggested fix
     ret = pthread_create(&mock->thread, NULL, tenant_policy_server_loop, mock);
     if (ret != 0) {
+        flb_http_server_stop(&mock->server);
         flb_http_server_destroy(&mock->server);
         mk_event_loop_destroy(mock->event_loop);
         mock->event_loop = NULL;
         return -1;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/runtime/out_loki.c` around lines 278 - 283, On pthread_create failure
path where ret != 0, stop the HTTP server before destroying it to avoid leaking
listening resources: call flb_http_server_stop(&mock->server) prior to
flb_http_server_destroy(&mock->server) (same block that clears mock->event_loop
and returns -1), ensuring the server is cleanly stopped when pthread_create
fails after a successful start.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@tests/runtime/out_loki.c`:
- Around line 278-283: On pthread_create failure path where ret != 0, stop the
HTTP server before destroying it to avoid leaking listening resources: call
flb_http_server_stop(&mock->server) prior to
flb_http_server_destroy(&mock->server) (same block that clears mock->event_loop
and returns -1), ensuring the server is cleanly stopped when pthread_create
fails after a successful start.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: fdc3dad1-a832-49c3-a11e-c1fa89c7d1e6

📥 Commits

Reviewing files that changed from the base of the PR and between 47a2f77 and 607c3ae.

📒 Files selected for processing (3)
  • plugins/out_loki/loki.c
  • plugins/out_loki/loki.h
  • tests/runtime/out_loki.c

Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/runtime/out_loki.c`:
- Around line 47-53: start_tenant_policy_server can fail before the thread is
created but stop_tenant_policy_server unconditionally joins/destroys fields and
also writes to mock->stop without synchronization; modify teardown to check
whether the thread was successfully started before joining or destroying (guard
joins/destroys on a flag such as server->thread_started or thread != 0),
initialize and atomically access/mock->stop (use atomic or mutex) or set stop
only if thread exists, and ensure start_tenant_policy_server sets the guard on
successful thread creation so stop_tenant_policy_server safely handles partially
initialized servers (update references to struct tenant_policy_server,
start_tenant_policy_server, stop_tenant_policy_server, mock->stop, and thread).
- Around line 1069-1072: The test is nondeterministic because
flb_service_set(...) doesn't restrict the scheduler window, so the
partial_success branch may schedule a retry after the 2.5s assertion; update the
helper that creates ctx (where flb_service_set is called) to explicitly set the
scheduler/retry window to a value that covers the full retry backoff (e.g., >=
expected retry interval, such as 3000 ms) by passing the appropriate
scheduler/retry window option into flb_service_set for ctx so the assertion
reliably covers a full retry interval.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bdc616ec-a541-413a-8115-ac0dd06173b7

📥 Commits

Reviewing files that changed from the base of the PR and between 607c3ae and 4146c30.

📒 Files selected for processing (1)
  • tests/runtime/out_loki.c

Comment thread tests/runtime/out_loki.c
Comment thread tests/runtime/out_loki.c
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Signed-off-by: Hiroshi Hatake <hiroshi@chronosphere.io>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/runtime/out_loki.c (1)

1042-1045: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Make the mixed-tenant input deterministic.

These tests send tenant-a and tenant-b in two separate flb_lib_push() calls, but the regression here only exists when different tenants share the same chunk. If those pushes end up in separate chunks, the old implementation would still pass and these tests won't prove the fix. Please force both records into one chunk/batch, or add an assertion that this path is exercising a mixed-tenant chunk.

Also applies to: 1115-1118

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/runtime/out_loki.c` around lines 1042 - 1045, The test is
non-deterministic because two separate flb_lib_push() calls (using ctx, in_ffd,
tenant_a and tenant_b) may create two chunks; change the test to ensure both
tenant_a and tenant_b records are pushed into the same chunk by sending them in
a single flb_lib_push() payload (e.g. concatenate the two record lines into one
buffer separated by the expected record delimiter or wrap them in a single JSON
array) or otherwise force batch/chunking (e.g. adjust the input to emit both
records in one call) and keep the same TEST_CHECK(ret >= 0) assertion; update
both occurrences (the block around the shown calls and the similar block at
lines 1115-1118) and reference ctx and in_ffd when making the single push so the
mixed-tenant path is deterministically exercised.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tests/runtime/out_loki.c`:
- Around line 1042-1045: The test is non-deterministic because two separate
flb_lib_push() calls (using ctx, in_ffd, tenant_a and tenant_b) may create two
chunks; change the test to ensure both tenant_a and tenant_b records are pushed
into the same chunk by sending them in a single flb_lib_push() payload (e.g.
concatenate the two record lines into one buffer separated by the expected
record delimiter or wrap them in a single JSON array) or otherwise force
batch/chunking (e.g. adjust the input to emit both records in one call) and keep
the same TEST_CHECK(ret >= 0) assertion; update both occurrences (the block
around the shown calls and the similar block at lines 1115-1118) and reference
ctx and in_ffd when making the single push so the mixed-tenant path is
deterministically exercised.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e572ec43-9288-406b-a42a-11a511a9696c

📥 Commits

Reviewing files that changed from the base of the PR and between 046bb6f and 0033f99.

📒 Files selected for processing (1)
  • tests/runtime/out_loki.c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fluentbit Loki output Tenant_Id_Key routing not working properly.

1 participant