Skip to content

Commit 7624555

Browse files
Jens-Gclaude
andcommitted
Harden the MSVC build workflow against transient Docker daemon unavailability
GitHub-hosted Windows runners occasionally start the job before the Docker engine is listening on \\.\pipe\docker_engine. The first docker command then fails instantly with "failed to connect to the docker API", which surfaced as a confusing "LastTest.log not found" and a ~27s red build unrelated to any code change. - Add a "Wait for Docker daemon" step that polls `docker version` until the engine responds (up to 4 minutes) before any docker command, and fails with a clear message if it never comes up. - Retry `docker pull` (cached-image step) up to 3 times on transient registry/network/daemon errors; a "not found" result still falls through to building the image from scratch. - Retry the `docker run` build step up to 3 times, but only when the daemon was unreachable -- a genuine build or test failure is propagated immediately and never triggers a costly rebuild. Output is streamed via Tee-Object so the build log still appears live. CI-only change; no library code is affected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1 parent 7107195 commit 7624555

1 file changed

Lines changed: 70 additions & 18 deletions

File tree

.github/workflows/msvc.yml

Lines changed: 70 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,31 @@ jobs:
6868
$hash = (Get-FileHash -Algorithm SHA256 'build/docker/msvc/Dockerfile').Hash.ToLower().Substring(0, 12)
6969
"IMAGE_TAG=msvc-$hash" | Out-File -FilePath $env:GITHUB_ENV -Append
7070
71+
- name: Wait for Docker daemon
72+
shell: pwsh
73+
timeout-minutes: 5
74+
run: |
75+
# GitHub-hosted Windows runners occasionally have the Docker engine not
76+
# yet ready when the job starts (the named pipe
77+
# \\.\pipe\docker_engine is not listening). The subsequent docker
78+
# pull/run then fails instantly with "failed to connect to the docker
79+
# API", which previously surfaced as a confusing "LastTest.log not
80+
# found". Poll until the daemon answers (or fail with a clear message).
81+
$deadline = (Get-Date).AddMinutes(4)
82+
while ($true) {
83+
docker version --format '{{.Server.Version}}' 2>$null | Out-Null
84+
if ($LASTEXITCODE -eq 0) {
85+
Write-Host "Docker daemon is ready."
86+
break
87+
}
88+
if ((Get-Date) -gt $deadline) {
89+
Write-Error "Docker daemon did not become ready within the timeout."
90+
exit 1
91+
}
92+
Write-Host "Docker daemon not ready yet; retrying in 5s..."
93+
Start-Sleep -Seconds 5
94+
}
95+
7196
- name: Log in to GHCR
7297
if: github.event_name != 'pull_request'
7398
shell: pwsh
@@ -94,20 +119,28 @@ jobs:
94119
exit 1
95120
}
96121
97-
Write-Host "Attempting to pull hash-based tag: $($env:DOCKER_IMAGE):$($env:IMAGE_TAG)"
98-
$output = docker pull "$($env:DOCKER_IMAGE):$($env:IMAGE_TAG)" 2>&1
99-
$output | Out-Host
100-
101-
if ($LASTEXITCODE -eq 0) {
102-
Write-Host "Successfully pulled cached image with hash tag"
103-
$needBuild = $false
104-
} elseif ($output -match 'not found|does not exist|404|manifest unknown') {
105-
Write-Host "Image not found in registry, will build from scratch."
106-
$needBuild = $true
107-
} else {
108-
Write-Host "##[error]Docker pull failed with unexpected error. Check logs above."
109-
Write-Host "This may indicate an authentication issue, registry problem, or network error."
110-
exit 1
122+
$maxAttempts = 3
123+
for ($attempt = 1; $attempt -le $maxAttempts; $attempt++) {
124+
Write-Host "Attempting to pull hash-based tag (attempt $attempt/$maxAttempts): $($env:DOCKER_IMAGE):$($env:IMAGE_TAG)"
125+
$output = docker pull "$($env:DOCKER_IMAGE):$($env:IMAGE_TAG)" 2>&1
126+
$output | Out-Host
127+
128+
if ($LASTEXITCODE -eq 0) {
129+
Write-Host "Successfully pulled cached image with hash tag"
130+
$needBuild = $false
131+
break
132+
} elseif ($output -match 'not found|does not exist|404|manifest unknown') {
133+
Write-Host "Image not found in registry, will build from scratch."
134+
$needBuild = $true
135+
break
136+
} elseif ($attempt -lt $maxAttempts) {
137+
Write-Host "##[warning]Docker pull failed (attempt $attempt/$maxAttempts) - transient registry/network/daemon error; retrying in 10s..."
138+
Start-Sleep -Seconds 10
139+
} else {
140+
Write-Host "##[error]Docker pull failed after $maxAttempts attempts. Check logs above."
141+
Write-Host "This may indicate an authentication issue, registry problem, or network error."
142+
exit 1
143+
}
111144
}
112145
113146
Write-Host "Setting outputs: need_build=$needBuild"
@@ -144,10 +177,29 @@ jobs:
144177
timeout-minutes: 120
145178
shell: pwsh
146179
run: |
147-
docker run -v c:\src\thrift:C:\Thrift -v "${env:THRIFT_BUILD_DIR}:C:\build" --rm -t "$($env:DOCKER_IMAGE):$($env:IMAGE_TAG)" c:\thrift\build\docker\msvc\build.bat
148-
if ($LASTEXITCODE -ne 0) {
149-
Write-Error "Container build failed with exit code $LASTEXITCODE"
150-
exit $LASTEXITCODE
180+
$maxAttempts = 3
181+
for ($attempt = 1; $attempt -le $maxAttempts; $attempt++) {
182+
Write-Host "Running build container (attempt $attempt/$maxAttempts)..."
183+
# Tee so the build log streams live AND is captured for the
184+
# daemon-error check below.
185+
docker run -v c:\src\thrift:C:\Thrift -v "${env:THRIFT_BUILD_DIR}:C:\build" --rm -t "$($env:DOCKER_IMAGE):$($env:IMAGE_TAG)" c:\thrift\build\docker\msvc\build.bat 2>&1 | Tee-Object -Variable output
186+
$code = $LASTEXITCODE
187+
if ($code -eq 0) {
188+
break
189+
}
190+
191+
# Retry only when the daemon was unreachable / the container never
192+
# started; a genuine build or test failure must NOT trigger a costly
193+
# rebuild and is propagated to the "Check test results" step.
194+
$daemonError = ($output -join "`n") -match 'pipe/docker_engine|error during connect|cannot connect to the Docker daemon|failed to connect to the docker API'
195+
if ($daemonError -and $attempt -lt $maxAttempts) {
196+
Write-Host "##[warning]Docker daemon not reachable (attempt $attempt/$maxAttempts); retrying in 10s..."
197+
Start-Sleep -Seconds 10
198+
continue
199+
}
200+
201+
Write-Error "Container build failed with exit code $code"
202+
exit $code
151203
}
152204
153205
- name: Check test results

0 commit comments

Comments
 (0)