Visual review dashboard, result sidecars, and unified uv packaging#139
Open
manykarim wants to merge 6 commits into
Open
Visual review dashboard, result sidecars, and unified uv packaging#139manykarim wants to merge 6 commits into
manykarim wants to merge 6 commits into
Conversation
Dashboard (new, ships as the [dashboard] extra): - FastAPI backend: output.xml ingestion (sidecar-first with HTML-scraping fallback), SQLite store, accept/reject with SHA-256 audit trail and bug-data export, root-confined asset serving, masks API with library- parity normalization, embedded comparison engine for live mask preview and recompare of stored runs, file browser, local file and results-folder upload, feature-advertising health endpoint - React/Vite frontend: run/test grids, diff viewer (side-by-side, overlay, blink, swipe, region navigation), react-konva mask editor with live pattern preview, create-mask-from-diff-region, version-skew banner - Tests: backend suites against real robot-generated runs, Playwright end-to-end journeys (ingest, review, accept, reject, mask editing, uploads, version skew), wheel parity gate Core library: - result_json sidecar (schema v1) with per-page scores, diff regions, resolved masks, and lossless renderings; DOCTEST_RESULT log line - implement previously non-functional reference_run promotion - fix _convert_to_pixels truncation (25.4mm @200dpi is now 200px) - OCR pattern fixes: original-case matching, line-level matching for line_pattern, phrase span matching for patterns containing whitespace Packaging (uv-unified-packaging): - single PEP 621 distribution with ai/dashboard/all extras, hatchling build, universal uv.lock; poetry removed - poetry multi-constraints translated to environment markers, validated per interpreter (3.9-3.13) by scripts/audit_resolved_versions.py - wheel/sdist parity with the poetry baseline gated by scripts/compare_wheel_contents.py - uv-based invoke tasks (incl. multipython), CI, publish workflow with frontend build step, docs Co-Authored-By: claude-flow <ruv@ruv.net>
…ld state uv sync on a fresh checkout failed because the wheel config force-included frontend/dist, which only exists after an npm build — hatchling errors on a missing force-include source even for editable installs (broke the smoke/test CI jobs, which never build the frontend). Vite now builds straight into doctest_dashboard/static and the wheel picks it up via an artifacts pattern instead, which is simply absent when not built: fresh clones sync without Node, release/dashboard CI builds the UI first as before, and the parity gate still asserts the static bundle is present in shipped wheels. The frontend/dist dev fallback in the server is gone — the package-internal static dir serves dev and installs alike. Co-Authored-By: claude-flow <ruv@ruv.net>
The 3.9 matrix job failed all LLM acceptance tests: pydantic-ai dropped 3.9 support, so the [ai] extra resolved the ancient pydantic-ai-slim 0.8.1 there, which cannot parse current OpenAI chat-completion responses (3.10+ resolves 1.x and passes). Per maintainer decision, the floor moves to Python 3.10 (3.9 is past end of life): requires-python >=3.10, 3.9-only dependency branches removed, pydantic-ai-slim pinned >=1, CI matrix/multipython/audit/docs/specs updated. The wheel-parity gate correctly flagged the new pin; it now carries a documented intentional-tightenings allowlist instead of weakening the pin. Co-Authored-By: claude-flow <ruv@ruv.net>
…tations Real fixes: - upload endpoints wrote files synchronously inside async handlers, blocking the event loop during large uploads; now streamed through anyio.open_file with oversize cleanup preserved - clickable file-browser rows and mask-list rows were mouse-only; they are now keyboard-operable (role=button, tabIndex, Enter/Space) - asset tokens are URL-encoded when building asset URLs By-design findings annotated with justification instead of changed: path construction from user input is the dashboard's purpose and every flagged site sits behind the root-confinement layer (is_within_roots with symlink-safe strict resolve) verified by dedicated traversal tests; regex construction from mask patterns is the library's documented feature, validated with re.compile at the API boundaries. Co-Authored-By: claude-flow <ruv@ruv.net>
The remaining Sonar reliability findings were the modal backdrop and container divs (presentation roles) and entry rows whose keyboard handlers arrived via prop spread, invisible to static analysis — now inlined. CI previously ran the full pipeline twice per PR commit (push and pull_request events); push triggers are limited to main, halving cost and exposure to live-LLM acceptance-test flakiness. Co-Authored-By: claude-flow <ruv@ruv.net>
CI runs failed nondeterministically across different LLM acceptance tests with 'UnexpectedModelBehavior: Exceeded maximum output retries (1)' — the remote model intermittently returns malformed structured output and pydantic-ai's default budget is a single retry. Agents now run with output_retries=3 (configurable via DOCTEST_LLM_OUTPUT_RETRIES or the llm_output_retries keyword override) and _run_agent performs one full re-run when the budget is still exhausted, turning a 1x1 budget into 3x2. Regression-tested with a fake agent failing once then recovering, the give-up path, and settings parsing. Co-Authored-By: claude-flow <ruv@ruv.net>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.




Summary
Three bodies of work, developed and verified together:
1. Visual Review Dashboard (
[dashboard]extra)A locally runnable web app to review visual comparison results outside
log.html— see docs/dashboard.md:output.xmlby path, via API, or by uploading a whole results folder from the browser (relative paths preserved); sidecar-first with HTML-scraping fallback for legacy runs (shown as degraded)masks.jsonI/O, unit/DPI-faithful display, live regex preview through the library's own text-extraction path, one-click create mask from diff region, and instant recompare of stored runs with adjusted masks2. Core library
result_jsonsidecar (schema v1): opt-in machine-readable result per comparison (per-page SSIM, diff regions, resolved masks, lossless renderings) +DOCTEST_RESULT:log line — the data contract for the dashboard and any future toolingreference_runimplemented: the documented save-candidate-as-reference behavior was a no-op; now functional in VisualTest and PdfTest_convert_to_pixelstruncated values before unit conversion (25.4 mm @ 200 DPI resolved to 196 px, now exactly 200 px); OCR pattern masks uppercased tokens before matching (mixed-case patterns like.*Robot.*never matched);line_patternwas silently word-level; phrases with spaces now match at line level, masking exactly the regex match span3. Unified uv packaging
robotframework-doctestlibrarywith extras[ai],[dashboard],[all]; poetry fully removed (hatchling build, universaluv.lock)scripts/audit_resolved_versions.pyand the newinvoke multipythontaskscripts/compare_wheel_contents.pyasserts the new artifacts match a baseline captured from the poetry-core build (identicalDocTestcontent, exclusions, marker-evaluated dependency equivalence) — runs in CI and on the publish pathVerification
uv run)[dashboard]serves the bundled UI,[all]is the unionNotes for reviewers
workflow_dispatchof the new publish pipeline will exercise the frontend-build + parity gate on real runners for the first timeopenspec/changes/archive/with synced capability specs inopenspec/specs/🤖 Generated with claude-flow