feat(telemetry): propagate session IDs and process lineage to libdatadog#216
feat(telemetry): propagate session IDs and process lineage to libdatadog#216mabdinur wants to merge 1 commit into
Conversation
|
Blocked by the release: DataDog/libdatadog#1822 |
Next steps before mergingThis PR currently carries The patch overrides have now been dropped in the latest commit, pointing instead at the versions that will exist once DataDog/libdatadog#1989 merges and the crates are published:
This PR will not build until the following sequence completes:
|
|
b5aa9a2 to
ced353f
Compare
ced353f to
e750785
Compare
Bump the published libdatadog workspace stack to 4.x/5.x for TelemetryInstrumentationSessions and TraceExporter<NativeCapabilities>. Propagate runtime lineage env into config, telemetry, and the trace exporter, and refresh root and instrumentation lockfiles plus third-party license data. Capture session lineage as a pure LineageContext value: a single, narrowly scoped env read replaces the previous std::env::set_var write inside ConfigBuilder::build(), eliminating the POSIX-unsound mutation and the TOCTOU race between the var_os check and set_var. Tests inject env via a closure so no global state is touched. The public surface re-exports only LineageContext and TelemetryInstrumentationSessions; the env keys, install helper, and free spawn helpers are no longer exposed. Co-authored-by: Cursor <cursoragent@cursor.com>
e750785 to
76fb076
Compare
| libdd-common = { version = "3.0.1", default-features = false } | ||
| libdd-tinybytes = { version = "1.1.0", default-features = false } | ||
| libdd-data-pipeline = { version = "4.0.0", default-features = false, features = [ | ||
| "https", |
There was a problem hiding this comment.
do you need https or it has slipped in?
if you need it might be best to add a new https feature that is off by default.
| // Wait for the agent info to be fetched to get deterministic output when deciding | ||
| // to drop traces or not | ||
| self.trace_exporter | ||
| .wait_agent_info_ready(Duration::from_secs(5)) | ||
| // Block on the async `wait_agent_info_ready` from this plain | ||
| // `thread::spawn`; needed for deterministic sampling in tests. | ||
| tokio::runtime::Builder::new_current_thread() | ||
| .enable_all() | ||
| .build() | ||
| .expect("tokio runtime for wait_agent_info_ready") | ||
| .block_on( | ||
| self.trace_exporter | ||
| .wait_agent_info_ready(Duration::from_secs(5)), | ||
| ) |
There was a problem hiding this comment.
This conflicts with #226, you should wait for this to be merged before rebasing this PR
| // `set_var` is unsound on POSIX when raced with `getenv` from another thread. | ||
| // Callers must invoke `DatadogTracingBuilder::init` early in `main` before | ||
| // spawning threads; `OnceLock` makes the install one-shot. | ||
| #[allow(clippy::disallowed_methods)] | ||
| fn install_outbound_env(runtime_id: &str) { | ||
| // Preserve root from an upstream parent; otherwise we become the root. | ||
| if std::env::var_os(ENV_ROOT_RS_SESSION_ID).is_none() { | ||
| std::env::set_var(ENV_ROOT_RS_SESSION_ID, runtime_id); | ||
| } | ||
| std::env::set_var(ENV_PARENT_RS_SESSION_ID, runtime_id); | ||
| } |
There was a problem hiding this comment.
This is still not great as there is no guarantee that this is called early enough.
I would prefer if you used the ctor crate to verify that this is called whenever the process starts.
The ctor function should also call libc::setenv and getenv instead of the rust stdlib https://docs.rs/libc/latest/libc/fn.setenv.html
| //! `_DD_ROOT_RS_SESSION_ID` / `_DD_PARENT_RS_SESSION_ID` and installs the | ||
| //! outbound env so subprocesses inherit it via `Command::spawn`. | ||
| //! | ||
| //! Best-effort: not refreshed on bare `fork()`. The child shares the parent's |
There was a problem hiding this comment.
We could do something on fork using the pthread_atfork libc call, but rust apps in general don't continue after forking because it's quite dangerous
Background
An app often runs as a tree of processes — workers, subcommands, a Lambda re-exec'ing. Each process initializes its own tracer with a fresh
runtime_id, so without extra signal the backend sees them as unrelated. This PR tags every telemetry event withsession_id,parent_session_id, androot_session_idso the backend can stitch them into one logical instrumentation session.Wire protocol
Two env vars carry lineage between processes:
_DD_PARENT_RS_SESSION_ID— immediate parent'sruntime_id_DD_ROOT_RS_SESSION_ID—runtime_idof the root of the spawn chainThe first call to
sessions_from_runtime_id(from the tracer init pipeline) does both halves atomically under aOnceLock:TelemetryInstrumentationSessions { session_id = runtime_id, root, parent }. Values matchingruntime_idare dropped._DD_PARENT_RS_SESSION_ID = runtime_id, and_DD_ROOT_RS_SESSION_ID = runtime_idif it was unset. SubsequentCommand::spawninherits these automatically.Capture runs before install so we don't reflect our own id back as our parent.
Env scope
set_varwrites to this process's env block — there is no host-wide registry. Effects:main, before threads" contract (see Caveats).Command::env_clear/env_remove. If any process in the chain wipes its env before spawning (common in sandboxes, hermetic builders, container runtimes), the child starts a new lineage tree. Graceful degradation: no panic, just a missed correlation.Caveats
Threading.
set_varis unsound on POSIX when raced withgetenvfrom another thread.OnceLockmakes the install one-shot; callers must invokeDatadogTracingBuilder::init(or a sibling) early inmainbefore spawning threads. This matches the convention every other Datadog tracer relies on.Fork. Lineage is correct across
fork+exec(the default forCommand::spawn) becauseexecresets memory and the child re-captures fresh. Barefork()(Unix daemons,nix::unistd::fork) is not handled: the child inherits both the parent'sConfig::process_runtime_id()OnceLockand ourCAPTUREDOnceLock, so it reports the parent's session. A fix requires resetting both from apthread_atforkchild handler — out of scope here; noted in a module comment.Implementation
libdd-*workspace stack to 4.x/5.x; addslibdd-capabilities-impl2.0.0. The trace exporter is nowTraceExporter<NativeCapabilities>.core/telemetry_session.rs(new,pub(crate)): capture-and-install lives here; puresessions_from_envis unit-tested with an injected env closure.core/telemetry.rsandspan_exporter.rsconsume the captured value directly. No new public types or APIs.test-utils,wait_agent_info_readyruns on a dedicated current-thread tokio runtime since it's invoked from a plainthread::spawn.Tests
cargo nextest run --workspace --locked: 326 passedcargo test --workspace --doc --locked: 13 passedcargo clippy --workspace --all-targets --locked -- -D warnings: clean