Jmw/fentrytracer1#51142
Conversation
This change makes the fentry-based network tracer the default over kprobe, with graceful fallback to kprobe on unsupported kernels. Key changes: - Port missing features to fentry: tcp_done, tcp_read_sock, failed TCP connections, protocol classification (hybrid socket filter approach), SSL/TLS certificate collection - Add all missing eBPF maps to fentry manager (protocol classification, SSL certs, TCP retransmits, TLS tags, failure telemetry) - Change enable_fentry default to true with DD_SYSTEM_PROBE_NETWORK_ENABLE_FENTRY env var override - Graceful fallback: fentry failures now log a warning and fall back to kprobe instead of hard-failing - Add KMT fentry test sets (fentry_no_usm, fentry_only_usm) running on kernels >= 5.8 across both x86_64 and arm64 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…match - Remove netDevQueueRawTracepoint from programs map to avoid double registration (once via loop, once via explicit append with TracepointName). Enable it directly in enabledPrograms() instead. - Fix SSL sched_process_exit probe activation to use UID "ssl-certs" instead of "net", matching how it was declared via ssluprobes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adding DD_SYSTEM_PROBE_NETWORK_ENABLE_FENTRY dropped the auto-derived DD_NETWORK_CONFIG_ENABLE_FENTRY. Keep both to avoid breaking existing deployments. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The tcp_enter_loss/recovery/probe0 kprobes (merged from main) use BPF_BYPASSABLE_KPROBE which is defined in bpf_bypass.h. This header was not included in tracer-fentry.c, causing eBPF compilation failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If m.InitWithOptions fails after HeadlessSocketFilter opened a raw socket, close it to avoid leaking the FD when falling back to kprobe. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ompatible with CO-RE) - Remove unused "log" import from fentry/tracer.go - Remove bpf_bypass.h include from tracer-fentry.c — the bypass mechanism's inline assembly references an undefined symbol in CO-RE compilation mode - Use BPF_KPROBE instead of BPF_BYPASSABLE_KPROBE for the three congestion signal kprobes (tcp_enter_loss, tcp_enter_recovery, tcp_send_probe0) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove RETURN_IF_NOT_IN_SYSPROBE_TASK from tcp_done fentry handler. tcp_done fires from timeout/RST/softirq context where the PID namespace check fails, silently dropping failed connection tracking. - Propagate DefaultKprobeAttachMethod to fentry manager options so the static kprobes (tcp_enter_loss/recovery/probe0) respect the configured attach method. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
tcp_done_exit must match tcp_done entry which intentionally runs without RETURN_IF_NOT_IN_SYSPROBE_TASK for timeout/RST/softirq context. Keeping the guard on the exit path would leave close batches unflushed for failed connections. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adam's branch added fexit/tcp_close and fexit/tcp_done programs whose sole purpose was calling flush_conn_close_if_full(ctx). Upstream PR #49046 ("Remove system-probe perf batching logic") subsequently deleted flush_conn_close_if_full along with the network_config.enable_custom_batching flag, leaving these handlers calling an undefined function and breaking the eBPF build with implicit-function-declaration errors. The handlers had no other purpose, so remove them entirely along with the tcpDoneReturn constant, its programs-map entry, and the c.CustomBatchingEnabled-gated enableProgram block (the field itself was also removed by #49046). Update a stale tracer.go comment that referenced flush_conn_close_if_full. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The TCPCloseProgsMap was used as a tail-call program array for the connection-close flush path, alongside flush_conn_close_if_full(). PR #49046 ("Remove system-probe perf batching logic") removed the map along with the rest of the batching infrastructure, so referencing probes.TCPCloseProgsMap here is now a Go undefined-symbol error. Same root cause as the previous commit: rebase fallout from #49046. The fentry tracer never used this tail-call path (the comment in tracer.go's protocolClassificationTailCalls notes that fentry/tcp_close does inline cleanup rather than tail-calling tcp_close_progs), so the map registration was already vestigial. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The rebase produced two issues in the SK→fentry→kprobe selector chain:
1. The SK error-check `if` block was missing its closing brace, so a
stray fentry.LoadTracer call from Adam's branch ended up outside any
scope. This caused a Go syntax error ("unexpected name
initClosedConnEventHandler, expected (") when the parser hit the
next top-level declaration while still inside the unclosed block.
2. Adam's "warn and fall back to kprobe" pattern landed in the wrong
place — outside the err-handling cascade — duplicating the fentry
load attempt while the canonical fentry block (inside `if err
!= nil`) still hard-failed via `return nil, err`.
Fix both in a single edit: close the SK if-block correctly, drop the
duplicate fentry call, and apply Adam's warn-and-fall-through pattern
to the canonical fentry block. Result is the intended selector
behavior: try SK, fall through to fentry on ErrorDisabled, fall
through to kprobe on fentry failure or ErrorDisabled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… SSL cert probes have been ported
Removes four explicit t.Skip("...not (yet) supported on fentry") guards
and three setupTracer/testConfig branches that forced
ProtocolClassificationEnabled=false under fentry. Keeps httpSupported /
httpsSupported / TestKprobeAttachWithKprobeEvents fentry gates in place
since USM transaction tracking and kprobe-specific paths are
intentionally out of scope.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Logs each tracer attempt and outcome so the SK -> fentry -> kprobe cascade is visible in system-probe logs. Use grep "JMW network tracer:" to see the full selection trail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fb37158950
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Files inventory check summaryFile checks results against ancestor 9ae29b68: Results for datadog-agent_7.81.0~devel.git.71.7a0af9d.pipeline.114554832-1_amd64.deb:Detected file changes:
|
Static quality checks❌ Please find below the results from static quality gates Error
Gate failure full details
Static quality gates prevent the PR to merge! Successful checksInfo
16 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 9ae29b6 Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +0.49 | [-2.39, +3.37] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | +0.49 | [-2.39, +3.37] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.35 | [+0.15, +0.54] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | +0.20 | [+0.16, +0.24] | 1 | Logs bounds checks dashboard |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.03 | [-0.50, +0.55] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.01 | [-0.44, +0.46] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | +0.01 | [-0.20, +0.22] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.00 | [-0.21, +0.21] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.01 | [-0.10, +0.09] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.02 | [-0.07, +0.03] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.15, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.08 | [-0.48, +0.31] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.10 | [-0.26, +0.06] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.10 | [-0.21, +0.01] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | -0.13 | [-0.23, -0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.18 | [-0.37, +0.00] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.21 | [-0.37, -0.04] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.24 | [-0.28, -0.19] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.28 | [-0.52, -0.05] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | -0.49 | [-0.55, -0.43] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.59 | [-0.77, -0.41] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.62 | [-0.67, -0.57] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_logs | % cpu utilization | -0.66 | [-1.65, +0.33] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.68 | [-1.93, -1.43] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 700 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 246.59MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 721 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 144.60MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 480.63MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.55MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 361.12 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 375.89MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
Adam's "feat(npm): make fentry tracer the default network tracer" commit message says the default should be true, but the diff left it at false. Flip it for testing so fentry actually loads without DD_NETWORK_CONFIG_ENABLE_FENTRY needing to be set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb37158 to
7a0af9d
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7a0af9dd04
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| EnableEbpfless bool | ||
|
|
||
| // EnableFentry enables the experimental fentry tracer (disabled by default) | ||
| // EnableFentry enables the fentry tracer (disabled by default, enable via config or DD_SYSTEM_PROBE_NETWORK_ENABLE_FENTRY) |
There was a problem hiding this comment.
Align fentry default comment with actual default
The field comment says fentry is "disabled by default," but this commit sets network_config.enable_fentry default to true in InitSystemProbeConfig (pkg/config/setup/system_probe.go). This mismatch creates stale in-repo documentation and can mislead operators or tooling that reads these comments about how the tracer is selected by default.
Useful? React with 👍 / 👎.
Adam's earlier switch to BPF_KPROBE (in 73ae536 "bpf_bypass.h incompatible with CO-RE") was based on a misdiagnosis. The CO-RE incompatibility claim was empirically disproven: a pre-Adam mainline agent on Azure AKS loads the BPF_BYPASSABLE_KPROBE handlers in the CO-RE fentry .o without issue, with kprobe__tcp_enter_loss attached and firing. Reasons to restore the bypassable variant: - tracer.c (runtime kprobe tracer) uses BPF_BYPASSABLE_KPROBE for the same three handlers. The two NPM tracer variants should behave consistently. - The bypass mechanism wires into tracer.Pause()/Resume() via ebpf-manager's BypassEnabled option. With BPF_KPROBE here, the fentry tracer silently leaves these probes firing under Pause() while the kprobe tracer pauses them — a test-time asymmetry. - Cost is negligible (~2-4 extra BPF instructions per call on cold TCP state transitions). Also add an explicit #include "bpf_bypass.h" to make the dependency clear, rather than relying on the transitive chain through tracer/stats.h → protocols/tls/tls-certs.h. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5afca455db
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| cfg.BindEnvAndSetDefault("network_config.enable_co_re", true) | ||
| cfg.BindEnvAndSetDefault("network_config.enable_fentry", false) | ||
| cfg.BindEnvAndSetDefault("network_config.enable_sk_tracer", false) | ||
| cfg.BindEnvAndSetDefault("network_config.enable_fentry", true, "DD_SYSTEM_PROBE_NETWORK_ENABLE_FENTRY", "DD_NETWORK_CONFIG_ENABLE_FENTRY") |
There was a problem hiding this comment.
Keep fentry default aligned with config schema
This flips network_config.enable_fentry to default true at runtime, but the schema still declares it as false (pkg/config/schema/system-probe_schema.yaml), so schema-driven outputs and tooling can disagree with actual agent behavior. That mismatch makes troubleshooting and rollout control error-prone (for example, generated/default config views show fentry off while the process enables it). Please update the schema/default metadata in the same change so all config paths agree.
Useful? React with 👍 / 👎.
What does this PR do?
Motivation
Describe how you validated your changes
Additional Notes