|
| 1 | +# CLAUDE.md — Bank Statements Processor |
| 2 | + |
| 3 | +## Repo Overview |
| 4 | + |
| 5 | +This is the **free-tier** open-source repo. The private `bankstatements-premium` repo holds the paid-tier Docker image published to GHCR. Do not conflate the two. |
| 6 | + |
| 7 | +- **Local Docker image name:** `bankstatementsprocessor` (built from `Dockerfile`) |
| 8 | +- **Production image:** `ghcr.io/longieirl/bankstatements-premium:latest` (private repo only) |
| 9 | +- Legitimate references to `ghcr.io/longieirl/bankstatements` belong only in `.env.remote`, `Makefile docker-push`, and `.github/workflows/`. |
| 10 | + |
| 11 | +Current version: **0.1.6** |
| 12 | + |
| 13 | +--- |
| 14 | + |
| 15 | +## Package Layout |
| 16 | + |
| 17 | +``` |
| 18 | +packages/ |
| 19 | + parser-core/ bankstatements-core (PyPI) — PDF extraction, services, templates |
| 20 | + parser-free/ bankstatements-free (free-tier CLI) — thin wrapper around parser-core |
| 21 | +templates/ shared bank template JSON files |
| 22 | +custom_templates/ user-overridable templates |
| 23 | +``` |
| 24 | + |
| 25 | +**Source of truth for Docker:** `packages/parser-core/` and `packages/parser-free/`. |
| 26 | +`src/` at the repo root is a mirror/symlink for local test running only — never edit it. |
| 27 | + |
| 28 | +Real source: `packages/parser-core/src/bankstatements_core/` |
| 29 | + |
| 30 | +### Module structure (`bankstatements_core`) |
| 31 | + |
| 32 | +``` |
| 33 | +adapters/ pdfplumber adapter |
| 34 | +analysis/ bbox utils, column/table analysis, template generator |
| 35 | +builders/ ProcessorBuilder |
| 36 | +commands/ CLI commands (analyze-pdf, init) |
| 37 | +config/ AppConfig, ProcessorConfig, EnvironmentParser |
| 38 | +domain/ models, protocols, services, converters, currency |
| 39 | +extraction/ PDFExtractor, IBANExtractor, RowBuilder, WordUtils |
| 40 | +facades/ ProcessingFacade |
| 41 | +patterns/ factories, repositories, strategies |
| 42 | +services/ all business logic services |
| 43 | +templates/ bank JSON templates + detectors |
| 44 | +entitlements.py |
| 45 | +processor.py |
| 46 | +pdf_table_extractor.py # legacy shim — delegates to extraction/, treat as deprecated |
| 47 | +``` |
| 48 | + |
| 49 | +--- |
| 50 | + |
| 51 | +## Development Setup |
| 52 | + |
| 53 | +```bash |
| 54 | +pip install -e packages/parser-core[dev,test] |
| 55 | +pip install -e packages/parser-free[test] |
| 56 | +``` |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## Running Tests |
| 61 | + |
| 62 | +```bash |
| 63 | +# parser-core (run from repo root) |
| 64 | +pytest packages/parser-core/tests/ --cov=bankstatements_core --cov-fail-under=91 |
| 65 | + |
| 66 | +# parser-free |
| 67 | +pytest packages/parser-free/tests/ |
| 68 | + |
| 69 | +# integration (run from repo root) |
| 70 | +python -m pytest packages/parser-core/tests/integration/ -m integration --no-cov |
| 71 | + |
| 72 | +# re-baseline integration snapshot |
| 73 | +pytest packages/parser-core/tests/integration/ -m integration --snapshot-update --no-cov |
| 74 | + |
| 75 | +# parallel (faster) |
| 76 | +pytest packages/parser-core/tests/ -n auto |
| 77 | +``` |
| 78 | + |
| 79 | +Tests default to `not integration` — run integration tests explicitly with `-m integration`. |
| 80 | +Coverage minimum: **91%** on `bankstatements-core`. |
| 81 | + |
| 82 | +--- |
| 83 | + |
| 84 | +## Linting & Formatting |
| 85 | + |
| 86 | +Run these together before every push (CI checks all four): |
| 87 | + |
| 88 | +```bash |
| 89 | +black packages/parser-core/src packages/parser-core/tests |
| 90 | +isort packages/parser-core/src packages/parser-core/tests |
| 91 | +ruff check packages/parser-core/src packages/parser-core/tests |
| 92 | +mypy packages/parser-core/src |
| 93 | +``` |
| 94 | + |
| 95 | +For `parser-free`, run isort **from within `packages/parser-free/`** — CI sort order differs from root. |
| 96 | + |
| 97 | +**Black gotcha:** Black collapses multi-line `raise`/`return` onto one line if it fits in 88 chars. Always write them as single lines: |
| 98 | +- `raise ValueError(f"...")` not a multi-line form |
| 99 | +- `raise TypeError(f"...")` not a multi-line form |
| 100 | + |
| 101 | +**Logging:** use `%`-formatting, not f-strings — enforced by ruff rule G004. |
| 102 | + |
| 103 | +--- |
| 104 | + |
| 105 | +## Make Targets |
| 106 | + |
| 107 | +```bash |
| 108 | +make docker-local # build from source + run |
| 109 | +make docker-remote # pull production image + run |
| 110 | +make docker-build # build only |
| 111 | +make docker-integration # snapshot-based Docker integration test |
| 112 | +make docker-scan-trivy # trivy HIGH/CRITICAL scan |
| 113 | +make docker-secure-run # network-isolated (GDPR mode) |
| 114 | +``` |
| 115 | + |
| 116 | +--- |
| 117 | + |
| 118 | +## Version Bumping |
| 119 | + |
| 120 | +Three files must always match — CI compares them and fails on mismatch: |
| 121 | + |
| 122 | +1. `packages/parser-core/pyproject.toml` → `version = "x.y.z"` |
| 123 | +2. `packages/parser-core/src/bankstatements_core/__version__.py` |
| 124 | +3. `packages/parser-free/pyproject.toml` → `version = "x.y.z"` |
| 125 | + |
| 126 | +```bash |
| 127 | +make version-bump-patch # bump x.x.N |
| 128 | +make version-bump-minor # bump x.N.0 |
| 129 | +make version-bump-major # bump N.0.0 |
| 130 | +``` |
| 131 | + |
| 132 | +--- |
| 133 | + |
| 134 | +## Creating Pull Requests |
| 135 | + |
| 136 | +**Never push directly to `main`.** Always create a feature branch, push the branch, and open a PR. Branch protection requires CI to pass before merge. |
| 137 | + |
| 138 | +```bash |
| 139 | +git checkout -b <branch-name> |
| 140 | +git push -u origin <branch-name> |
| 141 | +``` |
| 142 | + |
| 143 | +Always use `.github/PULL_REQUEST_TEMPLATE.md`. Pass `--assignee @me` on `gh pr create` — `gh pr edit` lacks the required token scope. |
| 144 | + |
| 145 | +```bash |
| 146 | +gh pr create --assignee @me --title "..." --body "$(cat <<'EOF' |
| 147 | +...populated template... |
| 148 | +EOF |
| 149 | +)" |
| 150 | +``` |
| 151 | + |
| 152 | +--- |
| 153 | + |
| 154 | +## Key Architecture Notes |
| 155 | + |
| 156 | +- `ExtractionResult.card_number: str | None` — `None` = bank statement, string = credit card (last-4 suffix) |
| 157 | +- `BankTemplate.column_aliases` — renames template keys to canonical column names; `RowPostProcessor._apply_column_aliases()` is the sole owner |
| 158 | +- `CCGroupingService` in `services/card_grouping.py` — groups CC results by last-4 card suffix |
| 159 | +- `processor.run()` splits on `card_number is None`: bank → `group_by_iban`, CC → `group_by_card` |
| 160 | +- `PDFProcessingOrchestrator.process_all_pdfs()` returns `tuple[list[ExtractionResult], int, int]` → `(results, pdf_count, pages_read)` |
| 161 | +- `ServiceRegistry.from_config(ProcessorConfig, Entitlements)` is the primary factory |
| 162 | +- Credit card support is **paid tier only** via `require_iban=False` in `Entitlements.paid_tier()` |
| 163 | +- Service layer uses `list[Transaction]` throughout — no dict round-trips internally; conversion at output boundary via `transactions_to_dicts()` |
| 164 | +- Architecture test (`test_architecture.py`) enforces module placement and bans circular imports |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## CI Workflows |
| 169 | + |
| 170 | +| Workflow | File | Trigger | |
| 171 | +|---|---|---| |
| 172 | +| Main CI | `ci.yml` | push/PR to main | |
| 173 | +| Release (root) | `release.yml` | tag push | |
| 174 | +| Release (core) | `release-core.yml` | tag push | |
| 175 | +| Security scan | `security-scan.yml` | schedule + push | |
| 176 | +| Boundary check | `boundary-check.yml` | push/PR | |
| 177 | +| PR labeler | `pr-labeler.yml` | PR open/sync | |
| 178 | + |
| 179 | +CI enforces: ruff, black, mypy, pylint design gates (Xenon), bandit, pip-audit, trivy (0 critical), coverage ≥ 91%. |
| 180 | + |
| 181 | +**Security:** workflows use quoted shell variables and avoid `${{ github.* }}` interpolation directly in `run:` steps to prevent shell injection (hardened in PRs #168–#171). Production image runs `apt-get upgrade -y` on every build to pull latest Debian patches. |
| 182 | + |
| 183 | +--- |
| 184 | + |
| 185 | +## Open Issues |
| 186 | + |
| 187 | +- **#59** — Docker integration CI job (blocked — needs fake PDFs; local tooling done in PR #70) |
| 188 | + |
| 189 | +--- |
| 190 | + |
| 191 | +## Gitignored Files (never commit) |
| 192 | + |
| 193 | +- `HANDOFF.md`, `MEMORY.md` |
| 194 | +- `.env.local` (may contain tokens) |
| 195 | +- `logs/processing_activity.jsonl` |
| 196 | +- `input/`, `output/` contents |
0 commit comments