Skip to content

Rewrite: agentic failure-triage, self-healing & RCA engine (0.4)#13

Merged
manykarim merged 81 commits into
mainfrom
rewrite/agentic-heal
Jun 15, 2026
Merged

Rewrite: agentic failure-triage, self-healing & RCA engine (0.4)#13
manykarim merged 81 commits into
mainfrom
rewrite/agentic-heal

Conversation

@manykarim

Copy link
Copy Markdown
Owner

Ground-up rewrite of robotframework-heal from a locator-only healer into a failure-triage, self-healing and root-cause-analysis engine built on pydantic-ai. Every change in this PR was developed spec-first (OpenSpec) and validated with real experiments against live model backends.

Highlights

  • Failure taxonomy — 7 classes (locator-drift, timing, viewport, overlay, form-state, assertion-drift, unknown). Deterministic detectors run first; an LLM triage agent only when needed.
  • Tiered locator healing — deterministic candidates → LLM index-pick → full-DOM generation fallback. Experiment-measured ~65–70% fewer tokens at equal accuracy, +27 accuracy points on 8B-class models.
  • Verified, never guessed — every proposed fix is checked against the live page before the keyword reruns; verification lives in output validators so it works on every model output mode.
  • Capability-tiered models — any OpenAI-compatible endpoint (vLLM, Ollama, LiteLLM, MiniMax, OpenRouter) or a pydantic-ai provider string; capability is probed (heal doctor), not assumed. Small models work via a prompted-JSON floor.
  • Frames & shadow DOM — open shadow roots and same-origin iframes heal via frame >>> inner selectors; an interaction-target blocklist fixes a demonstrated iframe false-heal.
  • Cross-run heal memory — known fixes warm-start from history.sqlite; repeat heals cost zero tokens.
  • Fix engine — read-only healed copies + word-highlighted HTML diffs by default (sources never touched); AST-based origin resolution (literal / variable / prefix${VAR}suffix / user-keyword argument call-site tracing) with blast-radius safety; opt-in patch/in-place tiers.
  • Reporting — crash-safe JSONL run store, self-contained HTML dashboard, summary.json for CI gates, cross-run history.
  • Surfaces — RF listener, heal CLI (triage/report/apply/doctor/mcp/corpus/history), MCP server + agent skill, SeleniumLibrary support ([selenium] extra), Appium.
  • Comprehensive docs site — Material for MkDocs, Diátaxis structure, config/CLI reference auto-generated from code with a drift guard, live at the GitHub Pages URL.

Breaking changes

  • Configuration moves to HEAL_* env vars (pydantic-settings); LLM_* no longer read.
  • Packaging migrates to PEP 621 + uv (hatchling); new heal package and heal console script.
  • New canonical listener import Library Heal. Back-compat preserved: Library SelfHealing (deprecated shim) and Library heal.rf.HealListener both still work.

See docs/reference/migration.md for the full 0.3 → 0.4 migration.

Testing

  • 172 unit tests (no LLM, no browser) green via invoke heal-utests.
  • Acceptance tests heal on a real browser through Library Heal (invoke heal-atests; locator suites via --live-llm).
  • Replay/eval harness over a 60-fixture corpus grades healing quality per model tier.
  • Every architectural assumption is backed by a recorded experiment (experiments/*/FINDINGS.md): the threading model (4/4 spike), small-model output modes, selection-vs-generation, frame piercing, Selenium primitives.

Notes

  • All four OpenSpec changes are archived; openspec/specs/ holds the deployed 19-capability baseline.
  • Merging triggers the new docs workflow to keep the published site current.
  • Legacy demo atests intentionally remain on the SelfHealing shim to exercise the deprecation path.

🤖 Generated with Claude Code

manykarim and others added 30 commits June 10, 2026 17:21
…sign

Proposal, design, 12 capability specs and phased tasks for rewriting
robotframework-heal as a pydantic-ai based failure-triage and RCA agent.

Includes MiniMax/OpenRouter probe suite (experiments/minimax-probe):
- root cause of small-model tool failures: forced tool_choice
  (fix: openai_supports_tool_choice_required=False, 311s -> 14s)
- ModelRetry verification works in prompted mode on all reachable models
- per-model capability matrix recorded in FINDINGS.md

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…g mapping

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ved output modes

Encodes probe findings: MiniMax tool_choice fix, vLLM strict-stripping,
prompted floor for unknown backends, tool gating by reliability tier.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ted on MiniMax)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Ports simplify_dom, unique-selector generation, candidate predicates and
fuzzy filtering from SelfHealing.utils; fixes nth-of-type identity bug and
fuzz-median score/item misalignment.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deterministic detectors (timing/locator-drift/overlay/viewport) resolve
without LLM; triage agent fallback on silence; budget suppression and
per-failure timeout; timing plugin heals by wait+rerun.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
TransactionRuntime: persistent healer loop, main-thread call marshalling,
abandonment with grace. Spike (4/4 PASS) proves rerun, assignment, parallel
async work and timeout unblock inside listener v3.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… rerun fallback

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… via slow server

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…iniMax preset fixed to prompted

End-to-end via SelfHealing shim: locator drift healed (MiniMax 18.6s,
gpt-4.1-nano 6.9s; greedy reuse 0.2s, no LLM), timing healed deterministically.
NativeOutput proved unreliable under ModelRetry loops on MiniMax -> preset
now resolves prompted; findings recorded.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…istory

Verified live: timing atest produces events.jsonl, heal_report.html,
summary.json and history.sqlite on close.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rough), overlay dismissal

Mobile specifics: page source covers only the current screen, so absence
may mean off-screen; swipe search heals it and falls through to locator
healing (single engine hop) when nothing is found. Locator proposals on
viewport-limited drivers verify via swipe search before rejection.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…orm/assertion healing

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… RCA agent enrichment

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… tiers

resolve_fix handles literal / variable / variable+suffix origins incl.
imported resources; cross-file usage scan drives blast radius; unified
patches verified git-appliable; in-place refused on dirty trees; listener
enriches event proposals and writes heal.patch + healed copies at close.
Validated live: gpt-4.1-nano heal produced a correct patch for the atest
suite.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
src/SelfHealing reduced to the deprecation shim; tests/utest superseded by
tests/unit; removed pyautogui/tinydb/cssify/parsimonious/litellm/opencv/lxml.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, pipeline diagram, changelog

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… configured

HealSettings loads the nearest .env (cwd upwards) via python-dotenv with
override=True, exporting non-HEAL_ keys too. Unconfigured-model healing now
suppresses with an actionable message instead of 'Engine error'. Verified
end-to-end: legacy atest heals with config purely from a local .env.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- assertion parsing handles RF's "Text 'a' (str) should be 'b' (str)" shape
- ambiguous locators (count>1, strict-mode violations) detected as drift;
  driver-level disambiguation (:visible >> nth=0 legacy parity)
- overlay heal without dismiss control falls through to locator healing
- argument-aware select verification: proposal must contain wanted options
- richer validator feedback for multi-match rejections

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
GLOBAL listener scope shares fixed_locators across suites; the proactive
swap in start_keyword now skips keywords inside expected-failure wrappers
(Run Keyword And Return Status etc.), matching the end_keyword rule.
Found via run_keywork_tests atest after a heal leaked from ait_llm.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ith findings

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Frame-safe healing (false-heal fix + pierced proposals), tiered locator
selection (experiment-backed: -68% tokens, +27pts on 8B models), cross-run
heal memory, SeleniumLibrary driver, self-growing eval corpus.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
manykarim and others added 29 commits June 15, 2026 15:29
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…reen

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…erence

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…cs/_refgen

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…tion

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, MCP)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…stores green build)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…rkflow

PRs validate with strict build; main deploys rolling 'latest'; release tags
add a pinned version. mike verified locally (one version, default set).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…deploy flow

The repository Pages toggle and live-URL verification are one-time manual
steps (no repo-admin access from here) documented in docs/CONTRIBUTING.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…onsistency

The docs taught 'Library heal.rf.HealListener' (verbose, never CI-tested)
while all atests used the deprecated 'Library SelfHealing' shim. Add a short,
idiomatic top-level 'Heal' module (Heal.py — a file, not a 'Heal/' dir, to
avoid a case-insensitive-FS collision with the 'heal/' package) re-exporting
the listener. Convert the 6 heal atests and all docs examples to 'Library
Heal' so the public API is dogfooded and CI-covered; keep SelfHealing and
heal.rf.HealListener working. Verified in a real RF run + new unit test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…reference specs

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ec deltas

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…, harden .gitignore

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…test step

- ci.yml ran 'invoke tests' (the legacy task) which executes the entire
  tests/atest/ dir — all live-LLM and external-site demo suites — against no
  configured model, so every seeded-failure test failed. Replace with the
  deterministic heal-utests + heal-atests (no LLM, no external sites).
- heal-utests now writes results/pytest.xml so the Test Report step has input
  (it was failing on a missing file).
- Remove the dead LLM_* env (superseded by HEAL_*); install --all-extras.
- Fix a duplicate 'Force Tags' line in heal_dom_edge_cases.robot (invalid RF).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The deterministic PR CI alone doesn't exercise the core value (LLM-driven
healing). Add a dedicated e2e workflow that runs the live-llm acceptance
suites — locator drift, keyword-arg call-site fixing, Selenium, shadow DOM /
iframes — on push to main, weekly, and manual dispatch (not on PRs: secrets
are unavailable to fork PRs and cost tokens).

- Reuses the existing 0.3 secrets (LLM_API_KEY/LLM_API_BASE/LLM_LOCATOR_MODEL,
  stripping the litellm 'openai/' prefix) with new HEAL_* overrides; skips
  cleanly when no key is configured.
- Restructure heal-atests to tag-driven sets: deterministic 'heal-atest'
  (both timing suites) for PR CI, 'live-llm' for e2e.
- Validated end-to-end: 8/8 live suites heal via Library Heal with a real
  model (shadow DOM, iframe pierce, keyword-arg trace, Selenium all pass).
- Flagged in docs: the 0.3 grok model id 404s on OpenRouter — set a current
  HEAL_MODEL variable.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Pre-checked all three models against the live MiniMax API: each resolves via
heal doctor and heals the canonical fixture correctly (M2.5 827tok, M2.7
671tok, M3 727tok), and MiniMax-M2.5 heals through the full real-browser
listener path. Add a matrix job to e2e.yml running the live-llm suites per
model on schedule/manual dispatch (token-cost controlled); skips without the
secret. Documented the required MINIMAX_API_KEY.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
… before merge)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…only

MiniMax sweep validated green on real runners (M2.5/M2.7/M3 all passed).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Per request, the MiniMax e2e matrix (M2.5/M2.7/M3 x all live suites) now runs
on pull requests as well as push-main/schedule/dispatch — it skips cleanly on
fork PRs without the secret. The generic live-heal job becomes opt-in (runs
only when a HEAL_MODEL repo variable is set) so it never fails PRs on an
unconfigured default.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…acts

Add grouped configuration tables (model + per-role overrides, behaviour
options, budgets, reporting) verified against the settings schema; document
the canonical 'Library Heal' import and its alternatives; list the CLI
commands and report artifacts; add the --live-llm dev command.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Markdown/docs/mkdocs changes no longer trigger the token-heavy 3-model live
sweep on pull requests. Code changes still run it; scheduled and manual
dispatch always run fully.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@manykarim manykarim merged commit 56752b2 into main Jun 15, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant