Conversation
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (18.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #379 +/- ##
==========================================
- Coverage 56.42% 54.85% -1.58%
==========================================
Files 60 61 +1
Lines 5366 5589 +223
Branches 484 525 +41
==========================================
+ Hits 3028 3066 +38
- Misses 2292 2477 +185
Partials 46 46
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR completes missing parts of the one-shot schema extraction plugin by adding LLM-generated website keywords, heuristic keyword generation, and schema-based text extraction, and adds a CLI example for running COMPASS against known local documents.
Changes:
- Add LLM-driven generators + caching for query templates, website keywords, and heuristic keyword lists in the one-shot plugin.
- Implement schema-based text extraction (structured-output) and update plugin/extractor call paths accordingly.
- Add a CLI “parse existing docs” example and wire it into the Sphinx examples index.
Reviewed changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| examples/parse_existing_docs/CLI/local_docs_minimal.json5 | Adds minimal local-doc mapping example (currently references a non-existent PDF filename). |
| examples/parse_existing_docs/CLI/local_docs.json5 | Adds fuller local-doc mapping example with metadata fields. |
| examples/parse_existing_docs/CLI/jurisdictions.csv | Adds sample jurisdictions input for the local-docs CLI run. |
| examples/parse_existing_docs/CLI/config.json5 | Adds sample run config demonstrating known_local_docs + disabled search. |
| examples/parse_existing_docs/CLI/README.rst | Adds CLI walkthrough for processing local docs (contains a couple typos). |
| examples/one_shot_schema_extraction/plugin_config_simple.json5 | Updates config option name + enables heuristic keyword auto-generation. |
| examples/one_shot_schema_extraction/plugin_config.yaml | Refreshes website keywords and adds heuristic keyword lists example. |
| examples/one_shot_schema_extraction/README.rst | Updates option name and fixes a doc link. |
| docs/source/examples/index.rst | Adds the “parse existing docs via CLI” example to the docs toctree. |
| compass/services/threaded.py | Adjusts jurisdiction document info dumping (currently breaks filename reporting for local docs). |
| compass/plugin/ordinance.py | Refactors text extractors to be direct LLM callers; updates usage labeling + call path. |
| compass/plugin/one_shot/schemas/website_keywords.json5 | Adds schema for LLM-generated website keyword weights. |
| compass/plugin/one_shot/schemas/heuristic_keywords.json5 | Adds schema for LLM-generated heuristic keyword lists. |
| compass/plugin/one_shot/schemas/extract_text.json5 | Adds schema for structured-output text extraction (verbatim or null). |
| compass/plugin/one_shot/generators.py | Adds website keyword + heuristic keyword generators and keyword normalization/deduping. |
| compass/plugin/one_shot/components.py | Implements schema-based text extractor/collector components (has a prompt typo). |
| compass/plugin/one_shot/cache.py | Adds a disk cache for LLM-generated content (hashing is not stable). |
| compass/plugin/one_shot/base.py | Wires in new generators, caching, heuristic support, and schema-based text extraction. |
| compass/plugin/noop.py | Removes legacy llm_caller init pattern for NoOp text extractor. |
| compass/plugin/interface.py | Updates text extraction instantiation and uses async get_heuristic() in filtering. |
| compass/extraction/apply.py | Improves attempt-count logging format for ngram-checked extraction retries. |
Add missing components, including LLM generated keywords, heuristic, and text extraction.