Skip to content

Visual review dashboard, result sidecars, and unified uv packaging#139

Open
manykarim wants to merge 6 commits into
mainfrom
feature/dashboard
Open

Visual review dashboard, result sidecars, and unified uv packaging#139
manykarim wants to merge 6 commits into
mainfrom
feature/dashboard

Conversation

@manykarim

Copy link
Copy Markdown
Owner

Summary

Three bodies of work, developed and verified together:

1. Visual Review Dashboard ([dashboard] extra)

A locally runnable web app to review visual comparison results outside log.html — see docs/dashboard.md:

  • Ingest output.xml by path, via API, or by uploading a whole results folder from the browser (relative paths preserved); sidecar-first with HTML-scraping fallback for legacy runs (shown as degraded)
  • Review: run/test grids with diff thumbnails, diff viewer (side-by-side / overlay / blink / swipe, keyboard-driven diff-region navigation), per-page accept (baseline promotion as a plain file copy with SHA-256 audit trail), reject with bug-data ZIP export; honest document-granularity handling for multi-page PDFs
  • Mask editor (react-konva): coordinate/area/pattern masks with schema-exact masks.json I/O, unit/DPI-faithful display, live regex preview through the library's own text-extraction path, one-click create mask from diff region, and instant recompare of stored runs with adjusted masks
  • Local-first security: binds 127.0.0.1, all file access confined to configured roots, optional bearer token for team mode; UI detects backend version skew and says "restart" instead of failing cryptically

2. Core library

  • result_json sidecar (schema v1): opt-in machine-readable result per comparison (per-page SSIM, diff regions, resolved masks, lossless renderings) + DOCTEST_RESULT: log line — the data contract for the dashboard and any future tooling
  • reference_run implemented: the documented save-candidate-as-reference behavior was a no-op; now functional in VisualTest and PdfTest
  • Bug fixes: _convert_to_pixels truncated values before unit conversion (25.4 mm @ 200 DPI resolved to 196 px, now exactly 200 px); OCR pattern masks uppercased tokens before matching (mixed-case patterns like .*Robot.* never matched); line_pattern was silently word-level; phrases with spaces now match at line level, masking exactly the regex match span

3. Unified uv packaging

  • One PEP 621 distribution robotframework-doctestlibrary with extras [ai], [dashboard], [all]; poetry fully removed (hatchling build, universal uv.lock)
  • Poetry's version-conditional pins translated to environment markers, validated per interpreter (3.9–3.13) by scripts/audit_resolved_versions.py and the new invoke multipython task
  • Wheel/sdist parity gate: scripts/compare_wheel_contents.py asserts the new artifacts match a baseline captured from the poetry-core build (identical DocTest content, exclusions, marker-evaluated dependency equivalence) — runs in CI and on the publish path
  • CI reworked to uv across the matrix; publish workflow builds the frontend before the wheel; version bumped to 0.34.0

Verification

  • Merged unit suite: 633 passed, 3 skipped (core + dashboard + parity gate in one uv run)
  • Acceptance suites: 85/85 via uv; Playwright e2e journeys: 12/12 (real robot runs, no mocked backend)
  • Dashboard tests pass under Python 3.9 (one typing fix); resolved-version audit green on 3.9/3.10/3.11/3.12/3.13
  • Fresh-venv install matrix from the wheel: base is lean (no fastapi/pydantic-ai, friendly CLI hint), [dashboard] serves the bundled UI, [all] is the union

Notes for reviewers

  • Baselines stay plain files: accepting a change produces a normal git diff in your test-data tree
  • The first workflow_dispatch of the new publish pipeline will exercise the frontend-build + parity gate on real runners for the first time
  • OpenSpec artifacts for both changes are archived under openspec/changes/archive/ with synced capability specs in openspec/specs/

🤖 Generated with claude-flow

manykarim and others added 6 commits June 12, 2026 15:08
Dashboard (new, ships as the [dashboard] extra):
- FastAPI backend: output.xml ingestion (sidecar-first with HTML-scraping
  fallback), SQLite store, accept/reject with SHA-256 audit trail and
  bug-data export, root-confined asset serving, masks API with library-
  parity normalization, embedded comparison engine for live mask preview
  and recompare of stored runs, file browser, local file and results-folder
  upload, feature-advertising health endpoint
- React/Vite frontend: run/test grids, diff viewer (side-by-side, overlay,
  blink, swipe, region navigation), react-konva mask editor with live
  pattern preview, create-mask-from-diff-region, version-skew banner
- Tests: backend suites against real robot-generated runs, Playwright
  end-to-end journeys (ingest, review, accept, reject, mask editing,
  uploads, version skew), wheel parity gate

Core library:
- result_json sidecar (schema v1) with per-page scores, diff regions,
  resolved masks, and lossless renderings; DOCTEST_RESULT log line
- implement previously non-functional reference_run promotion
- fix _convert_to_pixels truncation (25.4mm @200dpi is now 200px)
- OCR pattern fixes: original-case matching, line-level matching for
  line_pattern, phrase span matching for patterns containing whitespace

Packaging (uv-unified-packaging):
- single PEP 621 distribution with ai/dashboard/all extras, hatchling
  build, universal uv.lock; poetry removed
- poetry multi-constraints translated to environment markers, validated
  per interpreter (3.9-3.13) by scripts/audit_resolved_versions.py
- wheel/sdist parity with the poetry baseline gated by
  scripts/compare_wheel_contents.py
- uv-based invoke tasks (incl. multipython), CI, publish workflow with
  frontend build step, docs

Co-Authored-By: claude-flow <ruv@ruv.net>
…ld state

uv sync on a fresh checkout failed because the wheel config force-included
frontend/dist, which only exists after an npm build — hatchling errors on
a missing force-include source even for editable installs (broke the
smoke/test CI jobs, which never build the frontend).

Vite now builds straight into doctest_dashboard/static and the wheel picks
it up via an artifacts pattern instead, which is simply absent when not
built: fresh clones sync without Node, release/dashboard CI builds the UI
first as before, and the parity gate still asserts the static bundle is
present in shipped wheels. The frontend/dist dev fallback in the server is
gone — the package-internal static dir serves dev and installs alike.

Co-Authored-By: claude-flow <ruv@ruv.net>
The 3.9 matrix job failed all LLM acceptance tests: pydantic-ai dropped
3.9 support, so the [ai] extra resolved the ancient pydantic-ai-slim
0.8.1 there, which cannot parse current OpenAI chat-completion responses
(3.10+ resolves 1.x and passes). Per maintainer decision, the floor moves
to Python 3.10 (3.9 is past end of life): requires-python >=3.10,
3.9-only dependency branches removed, pydantic-ai-slim pinned >=1, CI
matrix/multipython/audit/docs/specs updated. The wheel-parity gate
correctly flagged the new pin; it now carries a documented
intentional-tightenings allowlist instead of weakening the pin.

Co-Authored-By: claude-flow <ruv@ruv.net>
…tations

Real fixes:
- upload endpoints wrote files synchronously inside async handlers,
  blocking the event loop during large uploads; now streamed through
  anyio.open_file with oversize cleanup preserved
- clickable file-browser rows and mask-list rows were mouse-only; they
  are now keyboard-operable (role=button, tabIndex, Enter/Space)
- asset tokens are URL-encoded when building asset URLs

By-design findings annotated with justification instead of changed:
path construction from user input is the dashboard's purpose and every
flagged site sits behind the root-confinement layer (is_within_roots
with symlink-safe strict resolve) verified by dedicated traversal tests;
regex construction from mask patterns is the library's documented
feature, validated with re.compile at the API boundaries.

Co-Authored-By: claude-flow <ruv@ruv.net>
The remaining Sonar reliability findings were the modal backdrop and
container divs (presentation roles) and entry rows whose keyboard
handlers arrived via prop spread, invisible to static analysis — now
inlined. CI previously ran the full pipeline twice per PR commit (push
and pull_request events); push triggers are limited to main, halving
cost and exposure to live-LLM acceptance-test flakiness.

Co-Authored-By: claude-flow <ruv@ruv.net>
CI runs failed nondeterministically across different LLM acceptance
tests with 'UnexpectedModelBehavior: Exceeded maximum output retries
(1)' — the remote model intermittently returns malformed structured
output and pydantic-ai's default budget is a single retry. Agents now
run with output_retries=3 (configurable via DOCTEST_LLM_OUTPUT_RETRIES
or the llm_output_retries keyword override) and _run_agent performs one
full re-run when the budget is still exhausted, turning a 1x1 budget
into 3x2. Regression-tested with a fake agent failing once then
recovering, the give-up path, and settings parsing.

Co-Authored-By: claude-flow <ruv@ruv.net>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
4 Security Hotspots
E Security Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant