fix(ci): make unit-tests gate poll tolerant of transient API errors#1852
fix(ci): make unit-tests gate poll tolerant of transient API errors#1852anurag-atlan wants to merge 1 commit into
Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
📜 Docstring Coverage ReportRESULT: PASSED (minimum: 30.0%, actual: 77.3%) Detailed Coverage Report |
📚 Capability manifest drift detected
To resolve, pick one:
This check is informational, not blocking. Reviewers should ensure the manifest is |
📦 Trivy Vulnerability Scan Results
Report SummaryCould not generate summary table (data length mismatch: 9 vs 8). Scan Result Detailsrequirements.txtuv.lock |
📦 Trivy Secret Scan Results
Report SummaryCould not generate summary table (data length mismatch: 9 vs 8). Scan Result Detailsrequirements.txtuv.lock |
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
|
🛠 Full Test Coverage Report: https://k.atlan.dev/coverage/application-sdk/pr/1852 |
🧪 SDR Integration Tests (testcontainer) — mysqlStatus: Failed ❌ What this validates
Re-runPush a new commit, or re-run from the Actions tab. |
|
@sdk-review |
|
🔍 SDK Review (mothership) triggered by @vaibhavatlan at 2026-05-29T11:07:56.833Z. Watch the workflow run live — the review summary will appear as a separate comment when complete (typical: 5–30 min, hard cap 2h). ✅ Completed — status |
Test SDK Review (mothership): PR #1852 — fix(ci): make unit-tests gate poll tolerant of transient API errorsVerdict: READY TO MERGE
Findings
Holistic Recommendations
Strengths
Note on CISeveral pre-existing checks are red on this PR ( CI: 4 failing (pre-existing, unrelated to this change) |
| RESULT=$(gh run view "${RUN_ID}" --repo "${REPO}" --json status,conclusion,url) | ||
| RESULT="" | ||
| for attempt in 1 2 3; do | ||
| if RESULT=$(gh run view "${RUN_ID}" --repo "${REPO}" --json status,conclusion,url 2>&1); then |
There was a problem hiding this comment.
[Important] [CORRECTNESS] — 2>&1 merges stderr into RESULT on success, which can defeat the resilience this PR is adding.
Evidence: RESULT=$(gh run view "${RUN_ID}" --repo "${REPO}" --json status,conclusion,url 2>&1)
The if RESULT=...; then break only checks the exit code. If gh exits 0 but also wrote anything to stderr (update notice, deprecation warning, future flag warning, etc.), RESULT contains mixed <stderr text> + JSON. The subsequent STATUS=$(echo "${RESULT}" | jq -r '.status') then fails with a parse error, and because set -euo pipefail is active (line 310), the script exits — killing the gate on a successful gh call. That's the exact failure mode this PR is trying to eliminate.
In practice this rarely fires (gh is usually quiet on --json), but the regression risk lives forever.
Path Forward (immediate fix): capture stderr separately so RESULT only ever holds JSON on success, and only the failure path sees the error text:
ERR_FILE=$(mktemp)
RESULT=""
for attempt in 1 2 3; do
if RESULT=$(gh run view "${RUN_ID}" --repo "${REPO}" --json status,conclusion,url 2>"${ERR_FILE}"); then
break
fi
echo " iteration ${i} attempt ${attempt}/3: gh run view failed: $(cat "${ERR_FILE}")"
RESULT=""
sleep 5
done
rm -f "${ERR_FILE}"Or, if you don't care about prefixing the error with the iteration label, drop 2>&1 entirely — gh's stderr will still surface in the GitHub Actions log just below the iteration line, which is good enough for debugging.
atlan-ci
left a comment
There was a problem hiding this comment.
SDK reviewer's verdict: READY TO MERGE.
Full review summary is in the comment posted on this PR.
|
Ignore above review. Just testing |
Human activity by @vaibhavatlan (issue_comment) detected on this PR — the @sdk-review bot's auto-approval has been dismissed. Re-comment @sdk-review once the PR is ready for the bot to re-approve, or have a code owner approve manually.
Summary
Make the `unit-tests-gate` poller resilient to transient GitHub API errors so a single `gh run view` hiccup doesn't kill publish on a passing test.
Problem
The poller has `set -euo pipefail` and runs `gh run view` once per 25s iteration. Any non-zero exit — 401 "Bad credentials", 5xx, transient network blip — kills the whole step. Worse, the warn-only safety net at the next step (`if: steps.wait.outputs.conclusion != 'success'`) doesn't fire when the wait step itself errors out, so all downstream publish jobs cascade-skip via their `!failure()` guards. Result: warn-only mode silently becomes fail-closed on infra blips.
Real incident
Run `atlanhq/atlan-databricks-app/actions/runs/26329578246`:
Tests passed; the gate blocked publish on a one-off API blip.
Fix
Wrap `gh run view` in a 3-attempt retry with 5s backoff inside each polling iteration. If all retries fail in one iteration, log and `continue` to the next 25s tick rather than failing the step. The 30-minute outer cap (`for i in $(seq 1 70)`) is unchanged, so eternal API outages still time out cleanly.
Test plan
Follow-ups (not in this PR)
The broader warn-only semantic issue still stands: even with this retry, an exhausted-retry exit `1` would still cascade-skip publish. Worth a separate change to either (a) write `conclusion=unknown` on retry exhaustion and `exit 0`, letting the warn-only annotate step decide, or (b) change the warn-only step's `if:` to `always()` and tighten downstream `needs:` conditions. Punting that to a follow-up so this fix is small and focused.