Visual review dashboard, result sidecars, and unified uv packaging by manykarim · Pull Request #139 · manykarim/robotframework-doctestlibrary

manykarim · 2026-06-12T13:09:30Z

Summary

Three bodies of work, developed and verified together:

1. Visual Review Dashboard (`[dashboard]` extra)

A locally runnable web app to review visual comparison results outside log.html — see docs/dashboard.md:

Ingest output.xml by path, via API, or by uploading a whole results folder from the browser (relative paths preserved); sidecar-first with HTML-scraping fallback for legacy runs (shown as degraded)
Review: run/test grids with diff thumbnails, diff viewer (side-by-side / overlay / blink / swipe, keyboard-driven diff-region navigation), per-page accept (baseline promotion as a plain file copy with SHA-256 audit trail), reject with bug-data ZIP export; honest document-granularity handling for multi-page PDFs
Mask editor (react-konva): coordinate/area/pattern masks with schema-exact masks.json I/O, unit/DPI-faithful display, live regex preview through the library's own text-extraction path, one-click create mask from diff region, and instant recompare of stored runs with adjusted masks
Local-first security: binds 127.0.0.1, all file access confined to configured roots, optional bearer token for team mode; UI detects backend version skew and says "restart" instead of failing cryptically

2. Core library

result_json sidecar (schema v1): opt-in machine-readable result per comparison (per-page SSIM, diff regions, resolved masks, lossless renderings) + DOCTEST_RESULT: log line — the data contract for the dashboard and any future tooling
reference_run implemented: the documented save-candidate-as-reference behavior was a no-op; now functional in VisualTest and PdfTest
Bug fixes: _convert_to_pixels truncated values before unit conversion (25.4 mm @ 200 DPI resolved to 196 px, now exactly 200 px); OCR pattern masks uppercased tokens before matching (mixed-case patterns like .*Robot.* never matched); line_pattern was silently word-level; phrases with spaces now match at line level, masking exactly the regex match span

3. Unified uv packaging

One PEP 621 distribution robotframework-doctestlibrary with extras [ai], [dashboard], [all]; poetry fully removed (hatchling build, universal uv.lock)
Poetry's version-conditional pins translated to environment markers, validated per interpreter (3.9–3.13) by scripts/audit_resolved_versions.py and the new invoke multipython task
Wheel/sdist parity gate: scripts/compare_wheel_contents.py asserts the new artifacts match a baseline captured from the poetry-core build (identical DocTest content, exclusions, marker-evaluated dependency equivalence) — runs in CI and on the publish path
CI reworked to uv across the matrix; publish workflow builds the frontend before the wheel; version bumped to 0.34.0

Verification

Merged unit suite: 633 passed, 3 skipped (core + dashboard + parity gate in one uv run)
Acceptance suites: 85/85 via uv; Playwright e2e journeys: 12/12 (real robot runs, no mocked backend)
Dashboard tests pass under Python 3.9 (one typing fix); resolved-version audit green on 3.9/3.10/3.11/3.12/3.13
Fresh-venv install matrix from the wheel: base is lean (no fastapi/pydantic-ai, friendly CLI hint), [dashboard] serves the bundled UI, [all] is the union

Notes for reviewers

Baselines stay plain files: accepting a change produces a normal git diff in your test-data tree
The first workflow_dispatch of the new publish pipeline will exercise the frontend-build + parity gate on real runners for the first time
OpenSpec artifacts for both changes are archived under openspec/changes/archive/ with synced capability specs in openspec/specs/

🤖 Generated with claude-flow

Dashboard (new, ships as the [dashboard] extra): - FastAPI backend: output.xml ingestion (sidecar-first with HTML-scraping fallback), SQLite store, accept/reject with SHA-256 audit trail and bug-data export, root-confined asset serving, masks API with library- parity normalization, embedded comparison engine for live mask preview and recompare of stored runs, file browser, local file and results-folder upload, feature-advertising health endpoint - React/Vite frontend: run/test grids, diff viewer (side-by-side, overlay, blink, swipe, region navigation), react-konva mask editor with live pattern preview, create-mask-from-diff-region, version-skew banner - Tests: backend suites against real robot-generated runs, Playwright end-to-end journeys (ingest, review, accept, reject, mask editing, uploads, version skew), wheel parity gate Core library: - result_json sidecar (schema v1) with per-page scores, diff regions, resolved masks, and lossless renderings; DOCTEST_RESULT log line - implement previously non-functional reference_run promotion - fix _convert_to_pixels truncation (25.4mm @200dpi is now 200px) - OCR pattern fixes: original-case matching, line-level matching for line_pattern, phrase span matching for patterns containing whitespace Packaging (uv-unified-packaging): - single PEP 621 distribution with ai/dashboard/all extras, hatchling build, universal uv.lock; poetry removed - poetry multi-constraints translated to environment markers, validated per interpreter (3.9-3.13) by scripts/audit_resolved_versions.py - wheel/sdist parity with the poetry baseline gated by scripts/compare_wheel_contents.py - uv-based invoke tasks (incl. multipython), CI, publish workflow with frontend build step, docs Co-Authored-By: claude-flow <ruv@ruv.net>

…ld state uv sync on a fresh checkout failed because the wheel config force-included frontend/dist, which only exists after an npm build — hatchling errors on a missing force-include source even for editable installs (broke the smoke/test CI jobs, which never build the frontend). Vite now builds straight into doctest_dashboard/static and the wheel picks it up via an artifacts pattern instead, which is simply absent when not built: fresh clones sync without Node, release/dashboard CI builds the UI first as before, and the parity gate still asserts the static bundle is present in shipped wheels. The frontend/dist dev fallback in the server is gone — the package-internal static dir serves dev and installs alike. Co-Authored-By: claude-flow <ruv@ruv.net>

The 3.9 matrix job failed all LLM acceptance tests: pydantic-ai dropped 3.9 support, so the [ai] extra resolved the ancient pydantic-ai-slim 0.8.1 there, which cannot parse current OpenAI chat-completion responses (3.10+ resolves 1.x and passes). Per maintainer decision, the floor moves to Python 3.10 (3.9 is past end of life): requires-python >=3.10, 3.9-only dependency branches removed, pydantic-ai-slim pinned >=1, CI matrix/multipython/audit/docs/specs updated. The wheel-parity gate correctly flagged the new pin; it now carries a documented intentional-tightenings allowlist instead of weakening the pin. Co-Authored-By: claude-flow <ruv@ruv.net>

…tations Real fixes: - upload endpoints wrote files synchronously inside async handlers, blocking the event loop during large uploads; now streamed through anyio.open_file with oversize cleanup preserved - clickable file-browser rows and mask-list rows were mouse-only; they are now keyboard-operable (role=button, tabIndex, Enter/Space) - asset tokens are URL-encoded when building asset URLs By-design findings annotated with justification instead of changed: path construction from user input is the dashboard's purpose and every flagged site sits behind the root-confinement layer (is_within_roots with symlink-safe strict resolve) verified by dedicated traversal tests; regex construction from mask patterns is the library's documented feature, validated with re.compile at the API boundaries. Co-Authored-By: claude-flow <ruv@ruv.net>

The remaining Sonar reliability findings were the modal backdrop and container divs (presentation roles) and entry rows whose keyboard handlers arrived via prop spread, invisible to static analysis — now inlined. CI previously ran the full pipeline twice per PR commit (push and pull_request events); push triggers are limited to main, halving cost and exposure to live-LLM acceptance-test flakiness. Co-Authored-By: claude-flow <ruv@ruv.net>

CI runs failed nondeterministically across different LLM acceptance tests with 'UnexpectedModelBehavior: Exceeded maximum output retries (1)' — the remote model intermittently returns malformed structured output and pydantic-ai's default budget is a single retry. Agents now run with output_retries=3 (configurable via DOCTEST_LLM_OUTPUT_RETRIES or the llm_output_retries keyword override) and _run_agent performs one full re-run when the budget is still exhausted, turning a 1x1 budget into 3x2. Regression-tested with a fake agent failing once then recovering, the give-up path, and settings parsing. Co-Authored-By: claude-flow <ruv@ruv.net>

sonarqubecloud · 2026-06-12T21:55:10Z

Quality Gate failed

Failed conditions
4 Security Hotspots
E Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

manykarim and others added 6 commits June 12, 2026 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Visual review dashboard, result sidecars, and unified uv packaging#139

Visual review dashboard, result sidecars, and unified uv packaging#139
manykarim wants to merge 6 commits into
mainfrom
feature/dashboard

manykarim commented Jun 12, 2026

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

manykarim commented Jun 12, 2026

Summary

1. Visual Review Dashboard ([dashboard] extra)

2. Core library

3. Unified uv packaging

Verification

Notes for reviewers

Uh oh!

sonarqubecloud Bot commented Jun 12, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Visual Review Dashboard (`[dashboard]` extra)