Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 59 additions & 31 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
name: CI

on: [push, pull_request]
# pull_request covers PR branches; push only on main — otherwise every PR
# commit runs the whole pipeline twice (doubling cost and exposure to
# live-LLM test flakiness).
on:
push:
branches: [main]
pull_request:

jobs:
smoke-no-ai:
Expand All @@ -9,10 +15,6 @@ jobs:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.11"
- name: Install Packages and Binaries
run: |
sudo apt-get update
Expand All @@ -27,25 +29,23 @@ jobs:
cp ${archive_name}/gpcl6* ${archive_name}/pcl6
sudo cp ${archive_name}/* /usr/bin
sudo cp policy.xml /etc/ImageMagick-6/
- name: Install and configure Poetry
uses: snok/install-poetry@v1
- name: Set up uv
uses: astral-sh/setup-uv@v5
with:
version: 1.8.5
virtualenvs-create: true
virtualenvs-in-project: false
virtualenvs-path: ~/.virtualenvs
installer-parallel: true
python-version: "3.11"
- name: Install base dependencies (without AI extras)
run: poetry install
run: uv sync
- name: Run smoke end-to-end tests without AI extras
run: |
poetry run robot \
uv run robot \
--loglevel=TRACE:INFO \
--listener RobotStackTracer \
--xunit xunit.xml \
-d results-smoke \
atest/Compare.robot \
atest/PdfContent.robot
atest/PdfContent.robot \
atest/ReferenceRun.robot \
atest/ResultJson.robot
- name: Store Smoke Artifacts
uses: actions/upload-artifact@v4
if: success() || failure()
Expand Down Expand Up @@ -75,16 +75,12 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.9", "3.11", "3.12", "3.13"]
python-version: ["3.10", "3.11", "3.12", "3.13"]
fail-fast: false
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Install Packages and Binaries
run: |
sudo apt-get update
Expand All @@ -99,20 +95,17 @@ jobs:
cp ${archive_name}/gpcl6* ${archive_name}/pcl6
sudo cp ${archive_name}/* /usr/bin
sudo cp policy.xml /etc/ImageMagick-6/
- name: Install and configure Poetry
uses: snok/install-poetry@v1
- name: Set up uv
uses: astral-sh/setup-uv@v5
with:
version: 1.8.5
virtualenvs-create: true
virtualenvs-in-project: false
virtualenvs-path: ~/.virtualenvs
installer-parallel: true
- name: Install dependencies
run: poetry install --extras ai
if: steps.cache.outputs.cache-hit != 'true'
python-version: ${{ matrix.python-version }}
- name: Install dependencies (all extras)
run: uv sync --all-extras --python ${{ matrix.python-version }}
- name: Audit resolved dependency versions
run: uv run python scripts/audit_resolved_versions.py
- name: Run tests
run: |
poetry run invoke tests
uv run invoke tests
- name: Store Artifact
uses: actions/upload-artifact@v4
if: success() || failure()
Expand All @@ -133,3 +126,38 @@ jobs:
path: results/pytest.xml,results/xunit.xml # Path to test results
reporter: java-junit # Format of test results

dashboard:
needs: smoke-no-ai
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Packages and Binaries
run: |
sudo apt-get update
sudo apt-get install -y imagemagick tesseract-ocr ghostscript libdmtx0b libzbar0
sudo cp policy.xml /etc/ImageMagick-6/
- name: Set up uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.13"
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: "22"
- name: Build frontend
working-directory: frontend
run: |
npm ci || npm install
npm run build
- name: Sync environment (all extras)
run: uv sync --all-extras
- name: Backend tests
run: uv run pytest utest/dashboard -v
- name: Wheel parity gate
run: |
uv build
uv run python scripts/compare_wheel_contents.py dist/*.whl --sdist dist/*.tar.gz
- name: Install Playwright browser
run: uv run playwright install chromium --with-deps
- name: End-to-end journeys
run: uv run pytest e2e -v --browser chromium
36 changes: 21 additions & 15 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,7 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.
# Builds and uploads the package to PyPI.
# The wheel bundles the dashboard web UI, so the frontend must be built
# before the Python package; a parity gate compares the artifacts against
# the committed baseline manifest before anything is uploaded.

name: Upload Python Package

Expand All @@ -16,17 +13,26 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
- uses: actions/checkout@v4
- name: Set up uv
uses: astral-sh/setup-uv@v5
with:
python-version: "3.12"
- name: Set up Node
uses: actions/setup-node@v4
with:
python-version: '3.x'
- name: Install dependencies
node-version: "22"
- name: Build frontend (bundled into the wheel)
working-directory: frontend
run: |
python -m pip install --upgrade pip
pip install build
npm ci || npm install
npm run build
- name: Build package
run: python -m build
run: uv build
- name: Parity gate against poetry baseline
run: |
uv sync
uv run python scripts/compare_wheel_contents.py dist/*.whl --sdist dist/*.tar.gz
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
Expand Down
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,10 @@ RELEASE*.md
.claude/
.*cache/
research/

# Frontend build artifacts (dashboard web UI)
frontend/node_modules/
doctest_dashboard/static/

# Dashboard runtime data
.doctest_dashboard/
5 changes: 3 additions & 2 deletions .gitpod.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,10 @@ tasks:
sudo apt-get update
sudo apt-get install -y imagemagick tesseract-ocr ghostscript libdmtx0b libzbar0 allure
sudo cp policy.xml /etc/ImageMagick-6/
poetry install
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync --all-extras
command: |
poetry run invoke tests
uv run invoke tests
image: gitpod/workspace-full-vnc
vscode:
extensions:
Expand Down
151 changes: 125 additions & 26 deletions DocTest/DocumentRepresentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -630,36 +630,126 @@
self._process_area_ignore_area(ignore_area)

def _process_pattern_ignore_area_from_ocr(self, ignore_area: Dict):
"""Handle pattern-based ignore areas by searching the OCR text for text patterns."""
"""Handle pattern-based ignore areas by searching the OCR text for text patterns.

Matching levels mirror the PDF text path:

- ``word_pattern``: individual OCR tokens
- ``line_pattern``: whole OCR lines (anchored match on the joined line)
- ``pattern``: individual tokens; when the pattern contains
whitespace (phrases like ``Robot Framework`` cannot match a single
token) it is searched anywhere inside each line and only the words
covered by the match span are masked — wrap the phrase in ``.*``
to mask the entire line instead.
"""
import re
pattern = ignore_area.get('pattern')
pattern_type = ignore_area.get('type') or 'pattern'
xoffset = int(ignore_area.get('xoffset', 0))
yoffset = int(ignore_area.get('yoffset', 0))

# Iterate through text data to identify matching patterns and mark as ignore areas
n_boxes = len(self.ocr_text_data['text'])
for j in range(n_boxes):
raw_text = self.ocr_text_data['text'][j]
normalized_text = self._normalize_token(raw_text)
if not normalized_text:
continue
match_target = normalized_text.upper()
if not re.match(pattern, match_target):
continue
def matches(text: str) -> bool:
# Match the original-case text first (consistent with the PDF text
# path); keep the legacy uppercased match as fallback so patterns
# written against uppercase targets continue to work.
return bool(re.match(pattern, text) or re.match(pattern, text.upper()))

x, y, w, h = (
self.ocr_text_data['left'][j],
self.ocr_text_data['top'][j],
self.ocr_text_data['width'][j],
self.ocr_text_data['height'][j],
)
text_mask = {
def add_area(x, y, w, h):
self.pixel_ignore_areas.append({
"x": int(x) - xoffset,
"y": int(y) - yoffset,
"width": int(w) + 2 * xoffset,
"height": int(h) + 2 * yoffset,
}
self.pixel_ignore_areas.append(text_mask)
})

has_line_info = all(
key in self.ocr_text_data for key in ('block_num', 'par_num', 'line_num')
)
word_level = pattern_type in ('pattern', 'word_pattern')
line_level = has_line_info and (
pattern_type == 'line_pattern'
or (pattern_type == 'pattern' and re.search(r'\s|\\s', pattern or ''))
)
if pattern_type == 'line_pattern' and not has_line_info:
# e.g. EAST engine output has no line structure — fall back to words
word_level = True

n_boxes = len(self.ocr_text_data['text'])

if word_level:
for j in range(n_boxes):
normalized_text = self._normalize_token(self.ocr_text_data['text'][j])
if not normalized_text or not matches(normalized_text):
continue
add_area(
self.ocr_text_data['left'][j],
self.ocr_text_data['top'][j],
self.ocr_text_data['width'][j],
self.ocr_text_data['height'][j],
)

if line_level:
def add_union_area(token_indices):
x1 = min(self.ocr_text_data['left'][j] for j in token_indices)
y1 = min(self.ocr_text_data['top'][j] for j in token_indices)
x2 = max(
self.ocr_text_data['left'][j] + self.ocr_text_data['width'][j]
for j in token_indices
)
y2 = max(
self.ocr_text_data['top'][j] + self.ocr_text_data['height'][j]
for j in token_indices
)
add_area(x1, y1, x2 - x1, y2 - y1)

lines: Dict[tuple, list] = {}
for j in range(n_boxes):
normalized_text = self._normalize_token(self.ocr_text_data['text'][j])
if not normalized_text:
continue
key = (
self.ocr_text_data['block_num'][j],
self.ocr_text_data['par_num'][j],
self.ocr_text_data['line_num'][j],
)
lines.setdefault(key, []).append(j)
for indices in lines.values():
tokens = [
self._normalize_token(self.ocr_text_data['text'][j]) for j in indices
]
line_text = " ".join(tokens)

if pattern_type == 'line_pattern':
if matches(line_text):
add_union_area(indices)
continue

# type 'pattern' with whitespace: search the phrase anywhere
# in the line and mask only the words the match span covers
spans = [
m.span() for m in re.finditer(pattern, line_text) if m.end() > m.start() # NOSONAR: patterns are test-author-supplied mask definitions, validated with re.compile at API boundaries

Check failure on line 730 in DocTest/DocumentRepresentation.py

View check run for this annotation

SonarQubeCloud / SonarCloud Code Analysis

Change this code to not construct the regular expression from user-controlled data.

See more on https://sonarcloud.io/project/issues?id=manykarim_robotframework-doctestlibrary&issues=AZ67-WQvvxKY-zko9yOa&open=AZ67-WQvvxKY-zko9yOa&pullRequest=139
]
if not spans:
# legacy fallback: patterns written against uppercase text
spans = [
m.span()
for m in re.finditer(pattern, line_text, re.IGNORECASE) # NOSONAR: same test-author-supplied pattern as above
if m.end() > m.start()
]
if not spans:
continue
offsets = []
position = 0
for token in tokens:
offsets.append((position, position + len(token)))
position += len(token) + 1
for start, end in spans:
covered = [
j for j, (token_start, token_end) in zip(indices, offsets)
if token_start < end and token_end > start
]
if covered:
add_union_area(covered)

def _process_pattern_ignore_area_from_pdf(self, ignore_area: Dict):
import re
Expand Down Expand Up @@ -709,18 +799,27 @@
self.pixel_ignore_areas.append({"x": x, "y": y, "height": h, "width": w})

def _convert_to_pixels(self, area: Dict, unit: str):
"""Convert dimensions from cm, mm, or px to pixel units."""
x, y, w, h = int(area['x']), int(area['y']), int(area['width']), int(area['height'])
"""Convert dimensions from cm, mm, pt, or px to pixel units.

Conversion is applied to the original (possibly fractional) values
and rounded only once at the end, so e.g. 25.4 mm at 200 DPI
resolves to exactly 200 px.
"""
x, y, w, h = float(area['x']), float(area['y']), float(area['width']), float(area['height'])
if unit == 'mm':
constant = self.dpi / 25.4
x, y, w, h = int(x * constant), int(y * constant), int(w * constant), int(h * constant)
elif unit == 'cm':
constant = self.dpi / 2.54
x, y, w, h = int(x * constant), int(y * constant), int(w * constant), int(h * constant)
elif unit == 'pt':
constant = self.dpi / 72.0
x, y, w, h = int(x * constant), int(y * constant), int(w * constant), int(h * constant)
return x, y, w, h
else:
constant = 1.0
return (
int(round(x * constant)),
int(round(y * constant)),
int(round(w * constant)),
int(round(h * constant)),
)

def _process_area_ignore_area(self, ignore_area: Dict):
"""Handle area-based ignore areas (e.g., 'top', 'bottom', 'left', 'right') as percentages."""
Expand Down Expand Up @@ -1215,7 +1314,7 @@
text_content += page.get_text("text")
return text_content if text_content.strip() else ""
except Exception as e: # defensive: covers ImportError, fitz internal errors, and I/O failures
logger.error("Failed to extract text from PDF: %s", e)

Check failure on line 1317 in DocTest/DocumentRepresentation.py

View check run for this annotation

SonarQubeCloud / SonarCloud Code Analysis

Use "logging.exception()" instead.

See more on https://sonarcloud.io/project/issues?id=manykarim_robotframework-doctestlibrary&issues=AZ67-WQvvxKY-zko9yOY&open=AZ67-WQvvxKY-zko9yOY&pullRequest=139
return ""

def compare_with(self, other_doc: 'DocumentRepresentation') -> bool:
Expand Down
Loading
Loading