Use container-scoped outputBase for Windows GitLab jobs#51154
Conversation
|
@codex review |
|
|
Codex Review: Didn't find any major issues. Chef's kiss. ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
d3a949d to
add6623
Compare
|
@codex review |
There was a problem hiding this comment.
Pull request overview
This PR updates the Windows Bazel wrapper (tools/bazel.bat) to avoid Bazel server/state collisions in GitLab CI by forcing a custom --output_base when running in CI (but not on GitHub Actions), so per-workspace Bazel state does not live under a shared/bind-mounted --output_user_root.
Changes:
- In GitLab CI (non-GitHub Actions), append
--output_base=%SYSTEMDRIVE%\bobas a Bazel startup option. - Keep using
--config=cifor CI runs while leaving other caches under the shared--output_user_root.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codex Review: Didn't find any major issues. Breezy! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
### What does this PR do? `tools/bazel.bat` now also passes `--output_base=%SYSTEMDRIVE%\bob` whenever it detects we're running inside a Windows container (via `DOTNET_RUNNING_IN_CONTAINER`, itself set unconditionally from `sc query CExecSvc`). The container's per-workspace Bazel state (`server\`, `lock`, `action_cache\`, `execroot\`, `external\`, `bazel-out\`) lands on its private writable layer rather than the shared `c:\bzl` bind mount. `install_base`, `disk_cache` and `repository_cache` continue to live under `--output_user_root=!XDG_CACHE_HOME!\bazel`. ### Motivation GitLab Windows runners bind-mount the same host `c:\bzl` into several concurrent containers (cf. `tools/ci/docker-run-with-bazel-cache.ps1` and `.gitlab/build/bazel/defs.yml`). Inside each container Bazel resolves the same `--output_user_root` and hashes the same in-container workspace path to the same output base directory, so concurrent JVM spawns race on `<output_base>\server\jvm.out`: \`\`\` FATAL: ExecuteDaemon(...): CreateJvmOutputFile( c:\bzl\bazel\bv4fleb2\server\jvm.out) failed: (error: 32): The process cannot access the file because it is being used by another process. \`\`\` Gating on `DOTNET_RUNNING_IN_CONTAINER` rather than a CI-system proxy (`CI && !GITHUB_ACTIONS`) targets the actual cause: any Windows container sharing a host volume. The container-local path stays short (`C:\bob`) to leave headroom under Bazel's Windows `MAX_PATH` budget. ### Describe how you validated your changes Best-effort: reviewed `bazel/src/main/cpp/blaze_util_windows.cc` and `startup_options.cc` to confirm `--output_base` takes priority over the default `<output_user_root>/<hashed-workspace>` and that `install_base` still derives from `--output_user_root`. Inspected the docker bind-mount layout in `tools/ci/docker-run-with-bazel-cache.ps1` and the shared `XDG_CACHE_HOME=c:/bzl` config in `.gitlab/build/bazel/defs.yml`.
add6623 to
9f561f0
Compare
Files inventory check summaryFile checks results against ancestor 7e031281: Results for datadog-agent_7.81.0~devel.git.139.9f561f0.pipeline.114443009-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates 33 successful checks with minimal change (< 2 KiB)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 832f42e Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | % cpu utilization | -5.29 | [-8.15, -2.43] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +1.56 | [+0.53, +2.59] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_metrics_logs | memory utilization | +0.53 | [+0.28, +0.78] | 1 | Logs bounds checks dashboard |
| ➖ | otlp_ingest_metrics | memory utilization | +0.43 | [+0.27, +0.59] | 1 | Logs |
| ➖ | ddot_metrics | memory utilization | +0.36 | [+0.16, +0.57] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.32 | [+0.22, +0.42] | 1 | Logs |
| ➖ | ddot_logs | memory utilization | +0.24 | [+0.19, +0.30] | 1 | Logs |
| ➖ | file_tree | memory utilization | +0.17 | [+0.13, +0.22] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | +0.03 | [-0.18, +0.24] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.01 | [-0.22, +0.19] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.02 | [-0.11, +0.07] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.15, +0.11] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | -0.03 | [-0.48, +0.41] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | -0.07 | [-0.46, +0.33] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | -0.07 | [-0.60, +0.46] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.09 | [-0.14, -0.05] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | -0.19 | [-0.35, -0.03] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.20 | [-0.44, +0.03] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.23 | [-0.29, -0.18] | 1 | Logs bounds checks dashboard |
| ➖ | quality_gate_idle | memory utilization | -0.24 | [-0.29, -0.19] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics_sum_delta | memory utilization | -0.43 | [-0.62, -0.24] | 1 | Logs |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | -0.75 | [-0.93, -0.57] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | -0.87 | [-0.99, -0.76] | 1 | Logs |
| ✅ | docker_containers_cpu | % cpu utilization | -5.29 | [-8.15, -2.43] | 1 | Logs |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 713 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 247.70MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 718 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.18GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 143.48MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle | total_bytes_received | 10/10 | 739.52KiB ≤ 819.20KiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 2 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 427.74MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | total_bytes_received | 10/10 | 1.12MiB ≤ 1.25MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 173.58MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_logs | total_bytes_received | 10/10 | 264.38MiB ≤ 292MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 357.46 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 374.12MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | total_bytes_received | 10/10 | 0.94GiB ≤ 1.04GiB | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check total_bytes_received: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
alopezz
left a comment
There was a problem hiding this comment.
This is fine, but the reason looks wrong.
GitLab Windows runners bind-mount the same host c:\bzl into several concurrent containers.
This should not happen in the current setup where runners are set up to run only one job at a time. I think a more likely hypothesis is unfinished (leaked) processes after a job. This PR would still probably mitigate this.
@alopezz seems like it, indeed. |
What does this PR do?
tools/bazel.batnow also passes--output_base=%SYSTEMDRIVE%\bobwhenever it detects we're running inside a Windows container, so each container's per-workspace Bazel state (server\,lock,action_cache\,execroot\,external\,bazel-out\) lives on its own writableoutputBaserather than any sharedoutputUserRoot(under a bind mount).install_base,disk_cacheandrepository_cachecontinue to live under the sharedoutputUserRoot.Motivation
GitLab Windows runners might bind-mount the same host
c:\bzlinto several concurrent containers, or more likely a filesystem handle to a file under that directory may leak when a job is interrupted.Inside each container, Bazel indeed resolves the same
outputUserRootand hashes the same in-container workspace path to the sameoutputBasedirectory:... so concurrent bazel server (JVM) spawns race: