Skip to content

feat(telemetry): propagate session IDs and process lineage to libdatadog#216

Open
mabdinur wants to merge 1 commit into
mainfrom
munir/add-runtime-id-propagation
Open

feat(telemetry): propagate session IDs and process lineage to libdatadog#216
mabdinur wants to merge 1 commit into
mainfrom
munir/add-runtime-id-propagation

Conversation

@mabdinur
Copy link
Copy Markdown
Contributor

@mabdinur mabdinur commented Apr 30, 2026

Background

An app often runs as a tree of processes — workers, subcommands, a Lambda re-exec'ing. Each process initializes its own tracer with a fresh runtime_id, so without extra signal the backend sees them as unrelated. This PR tags every telemetry event with session_id, parent_session_id, and root_session_id so the backend can stitch them into one logical instrumentation session.

Wire protocol

Two env vars carry lineage between processes:

  • _DD_PARENT_RS_SESSION_ID — immediate parent's runtime_id
  • _DD_ROOT_RS_SESSION_IDruntime_id of the root of the spawn chain

The first call to sessions_from_runtime_id (from the tracer init pipeline) does both halves atomically under a OnceLock:

  1. Capture inbound — read the env vars set by the spawning process and build TelemetryInstrumentationSessions { session_id = runtime_id, root, parent }. Values matching runtime_id are dropped.
  2. Install outbound — write _DD_PARENT_RS_SESSION_ID = runtime_id, and _DD_ROOT_RS_SESSION_ID = runtime_id if it was unset. Subsequent Command::spawn inherits these automatically.

Capture runs before install so we don't reflect our own id back as our parent.

Env scope

set_var writes to this process's env block — there is no host-wide registry. Effects:

  • Inherited by children we spawn after init (env is snapshotted at spawn).
  • Not visible to sibling processes, already-running processes, other shells/containers/users, or the parent shell after we exit.
  • Shared with threads in our own process, which is why the install is gated on the "init early in main, before threads" contract (see Caveats).
  • Broken by Command::env_clear / env_remove. If any process in the chain wipes its env before spawning (common in sandboxes, hermetic builders, container runtimes), the child starts a new lineage tree. Graceful degradation: no panic, just a missed correlation.

Caveats

Threading. set_var is unsound on POSIX when raced with getenv from another thread. OnceLock makes the install one-shot; callers must invoke DatadogTracingBuilder::init (or a sibling) early in main before spawning threads. This matches the convention every other Datadog tracer relies on.

Fork. Lineage is correct across fork+exec (the default for Command::spawn) because exec resets memory and the child re-captures fresh. Bare fork() (Unix daemons, nix::unistd::fork) is not handled: the child inherits both the parent's Config::process_runtime_id() OnceLock and our CAPTURED OnceLock, so it reports the parent's session. A fix requires resetting both from a pthread_atfork child handler — out of scope here; noted in a module comment.

Implementation

  • Bumps the published libdd-* workspace stack to 4.x/5.x; adds libdd-capabilities-impl 2.0.0. The trace exporter is now TraceExporter<NativeCapabilities>.
  • core/telemetry_session.rs (new, pub(crate)): capture-and-install lives here; pure sessions_from_env is unit-tested with an injected env closure.
  • core/telemetry.rs and span_exporter.rs consume the captured value directly. No new public types or APIs.
  • Under test-utils, wait_agent_info_ready runs on a dedicated current-thread tokio runtime since it's invoked from a plain thread::spawn.

Tests

  • cargo nextest run --workspace --locked: 326 passed
  • cargo test --workspace --doc --locked: 13 passed
  • cargo clippy --workspace --all-targets --locked -- -D warnings: clean

@mabdinur
Copy link
Copy Markdown
Contributor Author

Blocked by the release: DataDog/libdatadog#1822

@mabdinur
Copy link
Copy Markdown
Contributor Author

Next steps before merging

This PR currently carries [patch.crates-io] git overrides for six libdatadog crates because the APIs it depends on (TelemetryInstrumentationSessions, NativeCapabilities, session_id fields on telemetry Config) were added in DataDog/libdatadog#1822 without incrementing the crate version numbers, so they cannot be consumed directly from crates.io.

The patch overrides have now been dropped in the latest commit, pointing instead at the versions that will exist once DataDog/libdatadog#1989 merges and the crates are published:

Crate Before After
libdd-data-pipeline 3.0.1 (git patch) 3.1.0
libdd-telemetry 4.0.0 (git patch) 5.0.0
libdd-capabilities-impl 1.0.0 (git patch) 1.1.0

This PR will not build until the following sequence completes:

  1. DataDog/libdatadog#1989 — bumps libdd-capabilities-impl to 1.1.0 and libdd-data-pipeline to 3.1.0 (already open)
  2. libdatadog publishes libdd-capabilities-impl 1.1.0, libdd-data-pipeline 3.1.0, and libdd-telemetry 5.0.0 to crates.io
  3. Regenerate lock files (Cargo.lock and instrumentation/Cargo.lock) against the new published versions
  4. Merge this PR — no system-tests changes required once the crates are published

@datadog-prod-us1-5
Copy link
Copy Markdown

datadog-prod-us1-5 Bot commented May 19, 2026

Pipelines  Tests

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

Check Pull Request CI Status | ensure-ci-success   View in Datadog   GitHub Actions

🔄 Retry job. This looks flaky and may succeed on retry. Job failed due to some checks still running, no retries performed.

ℹ️ Info

No other issues found (see more)

🧪 All tests passed
❄️ No new flaky tests detected

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 76fb076 | Docs | Datadog PR Page | Give us feedback!

@mabdinur mabdinur force-pushed the munir/add-runtime-id-propagation branch 3 times, most recently from b5aa9a2 to ced353f Compare May 21, 2026 01:52
@mabdinur mabdinur changed the title feat: track process lineage for telemetry feat(datadog-opentelemetry): propagate session IDs and process lineage to libdatadog telemetry May 21, 2026
@mabdinur mabdinur changed the title feat(datadog-opentelemetry): propagate session IDs and process lineage to libdatadog telemetry feat(telemetry): propagate session IDs and process lineage to libdatadog May 21, 2026
@mabdinur mabdinur marked this pull request as ready for review May 21, 2026 01:59
@mabdinur mabdinur requested a review from a team as a code owner May 21, 2026 01:59
@mabdinur mabdinur force-pushed the munir/add-runtime-id-propagation branch from ced353f to e750785 Compare May 21, 2026 02:08
Bump the published libdatadog workspace stack to 4.x/5.x for
TelemetryInstrumentationSessions and TraceExporter<NativeCapabilities>.
Propagate runtime lineage env into config, telemetry, and the trace exporter,
and refresh root and instrumentation lockfiles plus third-party license data.

Capture session lineage as a pure LineageContext value: a single, narrowly
scoped env read replaces the previous std::env::set_var write inside
ConfigBuilder::build(), eliminating the POSIX-unsound mutation and the TOCTOU
race between the var_os check and set_var. Tests inject env via a closure so
no global state is touched. The public surface re-exports only LineageContext
and TelemetryInstrumentationSessions; the env keys, install helper, and free
spawn helpers are no longer exposed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@mabdinur mabdinur force-pushed the munir/add-runtime-id-propagation branch from e750785 to 76fb076 Compare May 21, 2026 02:11
@mabdinur mabdinur requested a review from khanayan123 May 21, 2026 02:37
Comment thread Cargo.toml
libdd-common = { version = "3.0.1", default-features = false }
libdd-tinybytes = { version = "1.1.0", default-features = false }
libdd-data-pipeline = { version = "4.0.0", default-features = false, features = [
"https",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need https or it has slipped in?
if you need it might be best to add a new https feature that is off by default.

Comment on lines -610 to +622
// Wait for the agent info to be fetched to get deterministic output when deciding
// to drop traces or not
self.trace_exporter
.wait_agent_info_ready(Duration::from_secs(5))
// Block on the async `wait_agent_info_ready` from this plain
// `thread::spawn`; needed for deterministic sampling in tests.
tokio::runtime::Builder::new_current_thread()
.enable_all()
.build()
.expect("tokio runtime for wait_agent_info_ready")
.block_on(
self.trace_exporter
.wait_agent_info_ready(Duration::from_secs(5)),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with #226, you should wait for this to be merged before rebasing this PR

Comment on lines +40 to +50
// `set_var` is unsound on POSIX when raced with `getenv` from another thread.
// Callers must invoke `DatadogTracingBuilder::init` early in `main` before
// spawning threads; `OnceLock` makes the install one-shot.
#[allow(clippy::disallowed_methods)]
fn install_outbound_env(runtime_id: &str) {
// Preserve root from an upstream parent; otherwise we become the root.
if std::env::var_os(ENV_ROOT_RS_SESSION_ID).is_none() {
std::env::set_var(ENV_ROOT_RS_SESSION_ID, runtime_id);
}
std::env::set_var(ENV_PARENT_RS_SESSION_ID, runtime_id);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not great as there is no guarantee that this is called early enough.
I would prefer if you used the ctor crate to verify that this is called whenever the process starts.
The ctor function should also call libc::setenv and getenv instead of the rust stdlib https://docs.rs/libc/latest/libc/fn.setenv.html

//! `_DD_ROOT_RS_SESSION_ID` / `_DD_PARENT_RS_SESSION_ID` and installs the
//! outbound env so subprocesses inherit it via `Command::spawn`.
//!
//! Best-effort: not refreshed on bare `fork()`. The child shares the parent's
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do something on fork using the pthread_atfork libc call, but rust apps in general don't continue after forking because it's quite dangerous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants