server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema#247
server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema#247javiermtorres wants to merge 20 commits into
Conversation
|
@copilot check if conflicts can be solved easily |
Conflicts resolved in f7e3464. Here's what was needed:
Also added |
|
@copilot rebase from origin/main and solve conflicts. |
Done in 45f715e. The branch is now rebased on origin/main — the diff from main contains exactly our 3 intended files:
Also fixed an unintended regression where |
|
Since the SDK and server stores were split at some point during the development of this PR, the agent tried to reconcile changes and this resulted in unwanted functionality. |
45f715e to
588ae9a
Compare
|
Removed the sdk changes and kept semsearch to the server store. |
There was a problem hiding this comment.
Pull request overview
This PR adds an optional semantic-search path to the server’s SQLite-backed RemoteStore using sqlite-vec + a remote embedding endpoint, and refactors the server store/API/tests to use async store methods so vector/HTTP work can be awaited during indexing and querying.
Changes:
- Introduces
cq_server.semsearchfor embedding generation, sqlite-vec schema management, and vector distance queries. - Refactors
RemoteStoreAPIs (and FastAPI routes + tests) to beasyncand integrates semantic upsert/query behavior. - Updates server packaging/tooling to install semsearch deps via an optional extra and updates setup script target.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| server/backend/src/cq_server/semsearch.py | New semantic search helper module (embedding fetch + sqlite-vec integration). |
| server/backend/src/cq_server/store/init.py | Integrates semsearch into RemoteStore, converts many methods to async, and changes query scoring behavior. |
| server/backend/src/cq_server/app.py | Makes API routes async and awaits store operations. |
| server/backend/src/cq_server/review.py | Makes review endpoints async where they call async store methods. |
| server/backend/src/cq_server/auth.py | Makes auth endpoints async where they call async store methods. |
| server/backend/tests/test_app.py | Converts tests to async and adds semsearch-specific query tests/expectations. |
| server/backend/tests/test_store.py | Converts tests to async and adjusts confidence expectations based on query-time scoring. |
| server/backend/tests/test_review.py | Converts tests to async to align with async auth/store calls. |
| server/backend/tests/test_queries.py | Converts tests to async to align with async store calls. |
| server/backend/tests/test_migrations.py | Converts seeding helper to async; currently has a missing await. |
| server/backend/tests/test_auth.py | Converts tests/helpers to async for async auth/store. |
| server/backend/pyproject.toml | Adds semsearch optional extra deps; adjusts pytest logging config. |
| Makefile | Installs server backend dev deps with --extra semsearch. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| row_data_by_id.get(unit.id, ()), | ||
| field_logits, | ||
| ) | ||
| unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting |
There was a problem hiding this comment.
query() mutates the returned KnowledgeUnit by overwriting unit.evidence.confidence with a query-time relevance score. This breaks the semantics of confidence (which is validated as 0..1 in the model), can exceed bounds or become negative, and can leak into persistence if a caller later passes the mutated unit back to update() (tests here do exactly that). Keep confidence unchanged and use a separate local score for sorting (or return a (unit, score) structure at an API boundary that can represent ranking).
| unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting |
There was a problem hiding this comment.
Still valid. During the merge the mutation moved to _sqlite.py:_score_semsearch_rows (unit.evidence.confidence = combined). Same concern: blends ranking score with the model's persisted 0..1 invariant, and a caller round-tripping a query() result through update() would persist the modulated value. The new test_query_returns_best_result_with_domain_overlap depends on the mutation for relative-order assertions, so resolving requires re-shaping that test too. Tracking together with #3153159271.
| for field_idx, logit_map in field_logits.items(): | ||
| if field_idx < len(row_data): | ||
| value = row_data[field_idx] | ||
| logit = logit_map.get(value, 0.0) | ||
| # Use (1.0 + logit) to keep positive: high logits boost, low logits diminish | ||
| combined *= 1.0 + logit | ||
|
|
There was a problem hiding this comment.
combined *= 1.0 + logit can flip the score negative when logit < -1 (e.g., if v/mean_val > e with invert=True). If this score is later exposed as/derived from confidence, it can violate expected non-negativity and ordering. Consider using a strictly-positive transform (e.g., combined *= exp(logit)), clamping, or otherwise bounding the adjustment so scores remain well-behaved.
There was a problem hiding this comment.
Still valid. The same multiplicative form lives in semsearch.compute_combined_relevance (moved during the merge from the deleted monolithic store/__init__.py to semsearch.py). If logit < -1 the score flips sign; needs real embeddings to trigger but matters for ranking robustness. Tracking.
| # Query returns the unit. | ||
| resp = client.get( | ||
| "/query", | ||
| params={"domains": ["api", "payments"], "languages": ["python"]}, | ||
| ) | ||
| assert len(resp.json()) == 1 | ||
| assert resp.json()[0]["evidence"]["confidence"] == 0.5 | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.35) | ||
|
|
||
| # Confirm boosts confidence. | ||
| resp = client.post(f"/confirm/{unit_id}") | ||
| assert resp.status_code == 200 | ||
|
|
||
| resp = client.get("/query", params={"domains": ["api", "payments"]}) | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.6) | ||
| assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.33) | ||
|
|
||
| # Flag reduces confidence. | ||
| resp = client.post(f"/flag/{unit_id}", json={"reason": "stale"}) | ||
| assert resp.status_code == 200 | ||
|
|
||
| resp = client.get("/query", params={"domains": ["api", "payments"]}) | ||
| result = resp.json()[0] | ||
| assert result["evidence"]["confidence"] == pytest.approx(0.45) | ||
| assert result["evidence"]["confidence"] == pytest.approx(0.2475) |
There was a problem hiding this comment.
These assertions assume /query returns a modified evidence.confidence that reflects query-time ranking (0.35/0.33/0.2475). That makes the test (and API behavior) dependent on semsearch enablement and on the ranking formula, and it conflates persisted evidence confidence with retrieval score. Prefer asserting the returned unit IDs/order and keep confidence checks tied to the persisted confidence updates from confirm/flag (which should stay within 0..1).
There was a problem hiding this comment.
Still partially valid. The brittle exact-value assertions are gone, but the new test_query_returns_best_result_with_domain_overlap still asserts on result['evidence']['confidence'] returned from /query, which depends on the mutation in _sqlite.py:_score_semsearch_rows. Same root cause as #3153158840; tracked together.
Brings in PR #337 (server: implement SqliteStore against async Store protocol), which absorbs the async-store + SqliteStore work this branch was carrying inline. Reconciles by: - Dropping the branch's monolithic store/__init__.py in favour of main's thin re-export + dedicated _sqlite.py implementation. - Wiring the semsearch hooks (load, ensure_schema, upsert_unit, combined_query) into the new SqliteStore: load on every pooled DBAPI connect via the SQLAlchemy event, ensure_schema during __init__, upsert_unit after insert/update, and combined_query + logit-modulated scoring in the query path when semsearch is enabled. - Moving the field-logit modulation helpers (build_field_logits, compute_combined_relevance) from the old store body into semsearch.py where they belong; dropping the debug print() calls that had crept in. - Taking main's sync test pattern (sync test methods + asyncio.run bridge for store calls) over the branch's async test conversion; carrying over the two semsearch-specific tests in test_app.py adapted to the sync style. 305 backend tests pass; the embedding-server-dependent test_query_returns_best_result_with_domain_overlap is correctly skipped when TOKEN_EMBEDDING_URL is unset. Lint clean.
Closes #332. cq-sdk 0.10.0 reads its scoring constants from cq-schema's canonical scoring.values.json, removing the drift hazard that the server's hand-mirrored cq_server.scoring carried. With this change the server consumes those constants transitively via the SDK rather than holding its own copy. - bump cq-sdk pin from ~=0.9.1 to ~=0.10.0 (pulls cq-schema 0.0.1) - swap `from .scoring` / `from cq_server.scoring` imports in app.py, store/_sqlite.py, tests/test_store.py to `from cq.scoring` - delete server/backend/src/cq_server/scoring.py (89 lines) — the SDK's cq.scoring is the canonical Python implementation - delete server/backend/tests/test_scoring.py (65 lines) — the same pattern-boost / signal-overlap behaviours are already covered by sdk/python/tests/test_scoring.py against the canonical implementation - add server/backend/tests/test_schema_oracle.py: validates that KnowledgeUnits as serialised by the server (model_dump_json with exclude_none=True, matching the SqliteStore persistence path) round- trip through cq_schema.load_schema("knowledge_unit"). Mirrors sdk/python/tests/test_schema_oracle.py - add jsonschema>=4.23.0 to the tests dependency group for the oracle 302 backend tests pass (previously 305: -5 from deleting test_scoring, +2 from the oracle); 1 skip remains for the embedding-server-dependent semsearch test. Lint clean. Behaviour note: cq.scoring.calculate_relevance clamps the result to [0, 1]; the deleted local copy did not. In practice this is a no-op (Jaccard ∈ [0, 1] and the four weights sum to 1.0, so the unclamped result is already in range), but the clamp removes a class of latent bug from any future weight tweak that doesn't preserve the sum.
…ver README Agent-Logs-Url: https://github.com/mozilla-ai/cq/sessions/61a0b30e-050e-4ada-a8df-e34dafa13d75 Co-authored-by: javiermtorres <4608715+javiermtorres@users.noreply.github.com>
Agent-Logs-Url: https://github.com/mozilla-ai/cq/sessions/c4c9b17a-e6cb-418f-a37e-eaa67609478d Co-authored-by: javiermtorres <4608715+javiermtorres@users.noreply.github.com>
491dab7 to
27b3671
Compare
Adds optional semantic similarity to the server's
/querypath using sqlite-vec and remote token embeddings, and migrates the server's scoring constants onto the canonicalcq-schemasource via the cq-sdk.Closes #21.
Closes #332.
Semantic search (#21)
When
TOKEN_EMBEDDING_URLis set and thesemsearchextra is installed (sqlite-vec,numpy,httpx),SqliteStore:connectevent;knowledge_units_vectable on schema bootstrap;insert/update;query(), fetches the per-domain candidate set joined with cosine distance viacombined_query, then ranks units byrelevance × confidence × Π(1 + logit(field))so semantically closer units boost their score and far ones diminish it.When the env var is unset (or the extra isn't installed) the path short-circuits via
semsearch.is_enabled()and behaviour matchesorigin/mainexactly. CI runs without the embedding server, so the modulation-specific test (test_query_returns_best_result_with_domain_overlap) skips by design.Scoring canonicalisation (#332)
cq-sdk 0.10.0 reads its scoring constants from cq-schema's canonical
scoring.values.json. With this change the server consumes those constants transitively via the SDK rather than holding its own copy.cq-sdkpin from~=0.9.1to~=0.10.0(pullscq-schema 0.0.1)from .scoringimports inapp.py,store/_sqlite.py,tests/test_store.pytofrom cq.scoringserver/backend/src/cq_server/scoring.pyandserver/backend/tests/test_scoring.py(the SDK'scq.scoringis now the canonical Python implementation; SDK tests cover the same behaviours)server/backend/tests/test_schema_oracle.pyto validate that server-serialised KnowledgeUnits round-trip throughcq_schema.load_schema("knowledge_unit")Behaviour note:
cq.scoring.calculate_relevanceclamps to[0, 1]; the deleted local copy did not. In practice a no-op (Jaccard ∈ [0,1] and the four weights sum to 1.0), but it removes a class of latent bug from any future weight tweak.Test plan
make test-server-backend— 302 passed, 1 skipped (embedding-server-dependent)make lint-server-backend