server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema by javiermtorres · Pull Request #247 · mozilla-ai/cq

javiermtorres · 2026-04-07T13:52:39Z

Adds optional semantic similarity to the server's /query path using sqlite-vec and remote token embeddings, and migrates the server's scoring constants onto the canonical cq-schema source via the cq-sdk.

Closes #21.
Closes #332.

Semantic search (#21)

When TOKEN_EMBEDDING_URL is set and the semsearch extra is installed (sqlite-vec, numpy, httpx), SqliteStore:

loads sqlite-vec into every pooled DBAPI connection via the SQLAlchemy connect event;
creates a knowledge_units_vec table on schema bootstrap;
upserts an embedding row for each unit on insert/update;
in query(), fetches the per-domain candidate set joined with cosine distance via combined_query, then ranks units by relevance × confidence × Π(1 + logit(field)) so semantically closer units boost their score and far ones diminish it.

When the env var is unset (or the extra isn't installed) the path short-circuits via semsearch.is_enabled() and behaviour matches origin/main exactly. CI runs without the embedding server, so the modulation-specific test (test_query_returns_best_result_with_domain_overlap) skips by design.

Scoring canonicalisation (#332)

cq-sdk 0.10.0 reads its scoring constants from cq-schema's canonical scoring.values.json. With this change the server consumes those constants transitively via the SDK rather than holding its own copy.

bump cq-sdk pin from ~=0.9.1 to ~=0.10.0 (pulls cq-schema 0.0.1)
swap from .scoring imports in app.py, store/_sqlite.py, tests/test_store.py to from cq.scoring
delete server/backend/src/cq_server/scoring.py and server/backend/tests/test_scoring.py (the SDK's cq.scoring is now the canonical Python implementation; SDK tests cover the same behaviours)
add server/backend/tests/test_schema_oracle.py to validate that server-serialised KnowledgeUnits round-trip through cq_schema.load_schema("knowledge_unit")

Behaviour note: cq.scoring.calculate_relevance clamps to [0, 1]; the deleted local copy did not. In practice a no-op (Jaccard ∈ [0,1] and the four weights sum to 1.0), but it removes a class of latent bug from any future weight tweak.

Test plan

make test-server-backend — 302 passed, 1 skipped (embedding-server-dependent)
make lint-server-backend
Manual smoke against a live embedding endpoint (out of CI scope)

Summary by CodeRabbit

New Features
- Optional semantic search added (disabled by default) to improve knowledge discovery when enabled.
Documentation
- Backend README documents semantic search activation, required env vars and expected behaviour.
Tests
- New and extended unit/integration tests covering semantic-search scoring, insertion and query routing; conditional e2e checks when embeddings available.
Chores
- Optional “semsearch” extra added, test logging/pytest tweaks, and dev sync target updated to include the semsearch extra.
CI
- New CI workflow to run lint, schema checks and embedding-enabled tests.

javiermtorres · 2026-04-07T13:53:15Z

@copilot check if conflicts can be solved easily

Copilot · 2026-04-07T14:12:47Z

@copilot check if conflicts can be solved easily

Conflicts resolved in f7e3464. Here's what was needed:

plugins/cq/server/ (modify/delete): Main deleted the Python MCP server (replaced by Go CLI). Accepted the deletion and ported our sqlite-vec changes to the new location.
sdk/python/src/cq/store.py (content, rename-detected): Git detected that plugins/cq/server/cq_mcp/local_store.py was renamed to sdk/python/src/cq/store.py on main. Merged our sqlite-vec additions (_VEC_SCHEMA_SQL, embedding_model param, vec insert/update/delete/query logic) into main's updated version.
sdk/python/tests/test_store.py (content, rename-detected): Similarly ported embedding_model fixture and updated LocalStore constructor calls.
server/backend/uv.lock (content): Took main's pygments 2.20.0 version.
team-api/pyproject.toml (modify/delete): Accepted deletion (renamed to server/backend/pyproject.toml on main; sqlite-vec now goes via the SDK dependency).

Also added sqlite-vec==0.1.7a10 and sentence-transformers[onnx] to sdk/python/pyproject.toml.

peteski22 · 2026-04-09T19:19:48Z

@copilot rebase from origin/main and solve conflicts.

Copilot · 2026-04-09T19:25:48Z

@copilot rebase from origin/main and solve conflicts.

Done in 45f715e. The branch is now rebased on origin/main — the diff from main contains exactly our 3 intended files:

sdk/python/src/cq/store.py — sqlite-vec virtual table + embedding model
sdk/python/tests/test_store.py — embedding_model fixture + updated constructors
sdk/python/pyproject.toml — sqlite-vec==0.1.7a10 and sentence-transformers[onnx] deps

Also fixed an unintended regression where server/backend/uv.lock had been modified by an old merge commit (restored to match main).

javiermtorres · 2026-04-17T07:29:15Z

Since the SDK and server stores were split at some point during the development of this PR, the agent tried to reconcile changes and this resulted in unwanted functionality.
@peteski22 as discussed, this will be implemented in the server sqlite instance using an encoderfile, behind a clean interface so other alternatives can be plugged in as well.

javiermtorres · 2026-04-24T07:21:46Z

Removed the sdk changes and kept semsearch to the server store.

Copilot

Pull request overview

This PR adds an optional semantic-search path to the server’s SQLite-backed RemoteStore using sqlite-vec + a remote embedding endpoint, and refactors the server store/API/tests to use async store methods so vector/HTTP work can be awaited during indexing and querying.

Changes:

Introduces cq_server.semsearch for embedding generation, sqlite-vec schema management, and vector distance queries.
Refactors RemoteStore APIs (and FastAPI routes + tests) to be async and integrates semantic upsert/query behavior.
Updates server packaging/tooling to install semsearch deps via an optional extra and updates setup script target.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
server/backend/src/cq_server/semsearch.py	New semantic search helper module (embedding fetch + sqlite-vec integration).
server/backend/src/cq_server/store/init.py	Integrates semsearch into `RemoteStore`, converts many methods to async, and changes query scoring behavior.
server/backend/src/cq_server/app.py	Makes API routes async and awaits store operations.
server/backend/src/cq_server/review.py	Makes review endpoints async where they call async store methods.
server/backend/src/cq_server/auth.py	Makes auth endpoints async where they call async store methods.
server/backend/tests/test_app.py	Converts tests to async and adds semsearch-specific query tests/expectations.
server/backend/tests/test_store.py	Converts tests to async and adjusts confidence expectations based on query-time scoring.
server/backend/tests/test_review.py	Converts tests to async to align with async auth/store calls.
server/backend/tests/test_queries.py	Converts tests to async to align with async store calls.
server/backend/tests/test_migrations.py	Converts seeding helper to async; currently has a missing await.
server/backend/tests/test_auth.py	Converts tests/helpers to async for async auth/store.
server/backend/pyproject.toml	Adds `semsearch` optional extra deps; adjusts pytest logging config.
Makefile	Installs server backend dev deps with `--extra semsearch`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-28T10:01:13Z

+                row_data_by_id.get(unit.id, ()),
+                field_logits,
+            )
+            unit.evidence.confidence = combined_relevance  # Update confidence to combined score for sorting


query() mutates the returned KnowledgeUnit by overwriting unit.evidence.confidence with a query-time relevance score. This breaks the semantics of confidence (which is validated as 0..1 in the model), can exceed bounds or become negative, and can leak into persistence if a caller later passes the mutated unit back to update() (tests here do exactly that). Keep confidence unchanged and use a separate local score for sorting (or return a (unit, score) structure at an API boundary that can represent ranking).

Suggested change

unit.evidence.confidence = combined_relevance # Update confidence to combined score for sorting

Still valid. During the merge the mutation moved to _sqlite.py:_score_semsearch_rows (unit.evidence.confidence = combined). Same concern: blends ranking score with the model's persisted 0..1 invariant, and a caller round-tripping a query() result through update() would persist the modulated value. The new test_query_returns_best_result_with_domain_overlap depends on the mutation for relative-order assertions, so resolving requires re-shaping that test too. Tracking together with #3153159271.

Copilot · 2026-04-28T10:01:13Z

+    for field_idx, logit_map in field_logits.items():
+        if field_idx < len(row_data):
+            value = row_data[field_idx]
+            logit = logit_map.get(value, 0.0)
+            # Use (1.0 + logit) to keep positive: high logits boost, low logits diminish
+            combined *= 1.0 + logit
+


combined *= 1.0 + logit can flip the score negative when logit < -1 (e.g., if v/mean_val > e with invert=True). If this score is later exposed as/derived from confidence, it can violate expected non-negativity and ordering. Consider using a strictly-positive transform (e.g., combined *= exp(logit)), clamping, or otherwise bounding the adjustment so scores remain well-behaved.

Still valid. The same multiplicative form lives in semsearch.compute_combined_relevance (moved during the merge from the deleted monolithic store/__init__.py to semsearch.py). If logit < -1 the score flips sign; needs real embeddings to trigger but matters for ranking robustness. Tracking.

Copilot · 2026-04-28T10:01:17Z

        # Query returns the unit.
        resp = client.get(
            "/query",
            params={"domains": ["api", "payments"], "languages": ["python"]},
        )
        assert len(resp.json()) == 1
-        assert resp.json()[0]["evidence"]["confidence"] == 0.5
+        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.35)

        # Confirm boosts confidence.
        resp = client.post(f"/confirm/{unit_id}")
        assert resp.status_code == 200

        resp = client.get("/query", params={"domains": ["api", "payments"]})
-        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.6)
+        assert resp.json()[0]["evidence"]["confidence"] == pytest.approx(0.33)

        # Flag reduces confidence.
        resp = client.post(f"/flag/{unit_id}", json={"reason": "stale"})
        assert resp.status_code == 200

        resp = client.get("/query", params={"domains": ["api", "payments"]})
        result = resp.json()[0]
-        assert result["evidence"]["confidence"] == pytest.approx(0.45)
+        assert result["evidence"]["confidence"] == pytest.approx(0.2475)


These assertions assume /query returns a modified evidence.confidence that reflects query-time ranking (0.35/0.33/0.2475). That makes the test (and API behavior) dependent on semsearch enablement and on the ranking formula, and it conflates persisted evidence confidence with retrieval score. Prefer asserting the returned unit IDs/order and keep confidence checks tied to the persisted confidence updates from confirm/flag (which should stay within 0..1).

Still partially valid. The brittle exact-value assertions are gone, but the new test_query_returns_best_result_with_domain_overlap still asserts on result['evidence']['confidence'] returned from /query, which depends on the mutation in _sqlite.py:_score_semsearch_rows. Same root cause as #3153158840; tracked together.

coderabbitai

Actionable comments posted: 24

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/backend/pyproject.toml (1)
44-48: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Duplicate jsonschema dependency in tests group.

Line 44 and line 48 both declare jsonschema>=4.23.0. This duplication is harmless at install time but violates DRY and will confuse maintainers reviewing the dependency list.
🔧 Proposed fix
 tests = [
     "jsonschema>=4.23.0",
     "pytest>=9.0.3",
     "pytest-asyncio>=1.3.0",
     "httpx",
-    "jsonschema>=4.23.0",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 44 - 48, Remove the duplicated
dependency entry "jsonschema>=4.23.0" from the tests dependency list in
pyproject.toml so it appears only once; locate the duplicate occurrences of
"jsonschema>=4.23.0" in the tests extras/dependencies block and delete the
redundant line, leaving a single entry (ensure commas/formatting around the
remaining lines remain valid).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/backend/README.md`:
- Around line 85-89: The fenced code block in server/backend/README.md lacks a
language identifier; update the three-backtick block containing the commands "uv
sync --extra semsearch" and 'pip install "cq-server[semsearch]"' to use a
shell/bash language tag (e.g., ```bash) so the commands render with proper
syntax highlighting.
- Around line 104-109: Update the fenced code block that shows the HTTP request
starting with "POST {TOKEN_EMBEDDING_URL}/predict" to include the language
identifier "http" (i.e., change the opening triple backticks to ```http) so the
example is syntax-highlighted properly; ensure only the fence marker is updated
and the request contents remain unchanged.
- Line 76: Remove the unused Markdown link reference `[issue-310]` from the
README by deleting the reference line "[issue-310]:
https://github.com/mozilla-ai/cq/issues/310" (or alternatively add an inline
reference to `[issue-310]` in the relevant section such as the semantic search
paragraph if the issue should be cited); update the README.md so there are no
dangling/unused link reference definitions.

In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 70-72: The current code calls await sem_insert_unit(...) inside
with self._db.engine.begin(), which opens a DB transaction while performing an
async HTTP embedding fetch; refactor so the HTTP call happens before opening the
transaction: call sem_fetch_embedding(unit) (or refactor sem_insert_unit into
sem_fetch_embedding + sem_store_embedding) to fetch/return the embedding async
outside the with self._db.engine.begin() block, then open with
self._db.engine.begin() and call sem_store_embedding(conn, unit.id,
embedding_data) (or the DB-only part of sem_insert_unit) so the DB transaction
is short and opened only during the SQL insert.
- Around line 84-93: The query path currently awaits an external HTTP embedding
call while holding a DB connection (see _SEMSEARCH_ENABLED and the with
self._db.engine.connect() block calling sem_query), which can exhaust the pool;
change the flow to fetch the query embedding before opening the DB connection
(use an async call like sem_fetch_query_embedding/domains), then open a
short-lived connection and call a DB-only function (e.g.,
sem_query_with_embedding(conn, query_embedding, languages, frameworks, pattern,
limit)) so no awaitable HTTP call occurs while the connection is held.
- Line 31: Remove the unused import text_clause from the top of the module;
locate the import statement "from sqlalchemy.sql.expression import text as
text_clause" in knowledge.py and delete it (ensure no other references to
text_clause or the alias remain in functions or classes such as any repository
methods in this file).
- Line 6: The import statement "from email.mime import text" is unused and
should be removed; delete that import line from the top of the file (the unused
symbol "text") so there are no unused imports remaining in knowledge.py and run
linting to confirm no further unused imports.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 93-100: The load function currently enables extension loading,
then calls sqlite_vec.load(conn) and later disables loading, but if
sqlite_vec.load raises an exception the conn remains in
enable_load_extension(True) state; fix this by wrapping the extension-loading
sequence in a try/finally: call conn.enable_load_extension(True) before trying
sqlite_vec.load(conn) and ensure conn.enable_load_extension(False) is executed
in the finally block; call ensure_schema(conn) after a successful load (inside
the try) and re-raise or let the original exception propagate so errors aren't
swallowed.
- Around line 30-33: The logger.warning message in __init__.py incorrectly tells
users to install "cq-sdk[embedding]"—update that message to reference the
correct package and extra: instruct users to install the cq-server package with
the "semsearch" extra (e.g., pip install cq-server[semsearch]) and clarify this
is for the cq-server package; modify the call to logger.warning (the string
passed to logger.warning) accordingly to mention TOKEN_EMBEDDING_URL, cq-server,
and the "semsearch" extra.
- Around line 15-16: The module-level int() conversion for
SEMSEARCH_EMBEDDING_DIM is unguarded and can raise ValueError during import;
update the initialization of _DIM to validate the environment value (referencing
SEMSEARCH_EMBEDDING_DIM and the module-level _DIM) by checking if the env var is
present and numeric, try/except or use str.isdigit() to parse it safely, fall
back to the default 768 when invalid, and log or warn about the invalid value so
the server doesn't crash even if _ENABLED is False.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Line 77: The assignment to the local variable `res` from the call to
`conn.execute(text_clause(_VEC_INSERT_SQL), {"unit_id": unit.id, "embedding":
serialized})` is unused; remove the unnecessary `res =` assignment and call
`conn.execute(...)` without capturing its return (or, if you intended to check
the result, handle `res` accordingly), so eliminate the unused `res` variable in
`queries.py` around the `_VEC_INSERT_SQL` insertion call.
- Around line 35-38: The AsyncClient usage in the semsearch embedding call lacks
a timeout, so configure a request timeout to avoid indefinite hangs: when
creating AsyncClient (the instance using base_url _TOKEN_EMBEDDING_URL) supply a
timeout (or an httpx.Timeout) and/or pass a timeout to the client.post call that
sends request_data with "inputs": wordlist to "/predict", e.g., a sensible
numeric timeout (e.g., 5–30s) or per-stage timeout object; ensure the change is
applied where AsyncClient is constructed and that error handling around
response.raise_for_status remains intact.
- Line 70: Replace the root-logger call logging.warning("Attempted to insert
embedding while semantic search is disabled; skipping embedding insert") with
the module-level logger (logger.warning(...)) to match the rest of the module;
locate the statement in semsearch/queries.py (the embedding-insert branch where
semantic search is disabled) and change logging.warning to logger.warning so the
message uses the module logger.
- Line 149: combined_query currently calls _get_embeddings(domains) while
query() uses _get_embeddings(normalized), causing inconsistent embeddings and
ranking; update combined_query to call _get_embeddings with the same normalized
input (i.e., use the normalized variable or run the same normalization step
before calling _get_embeddings) so both functions compute embeddings from
identical input (refer to combined_query, query, _get_embeddings, domains,
normalized).
- Line 10: Remove the unused import "from pydoc import text" at the top of the
module; it is never referenced and is shadowed by the local variable text in the
functions upsert_unit and insert_unit, so simply delete that import line to
eliminate the unused import.
- Around line 53-54: Remove the stray TODO-style comment and add a precise type
annotation for the `conn` parameter in `upsert_unit` to clarify the expected
SQLAlchemy connection object; replace `# check type of sqlalchemy conns` with a
typed signature such as using `sqlalchemy.engine.Connection` or
`AsyncConnection`/`AsyncEngine` as appropriate for this async function, and
update imports to include the chosen SQLAlchemy type and adjust any call sites
of `upsert_unit` if needed so types remain consistent with `KnowledgeUnit`.
- Around line 80-81: The TODO about typing in the async query function should be
removed and the conn parameter given a concrete SQLAlchemy async type to match
upsert_unit; update the signature of async def query to annotate conn as
sqlalchemy.ext.asyncio.AsyncConnection (or sqlalchemy.ext.asyncio.AsyncEngine if
the function expects an engine) and import that type at top of the module,
mirroring the annotation style used in upsert_unit so the parameter is strongly
typed and the TODO can be deleted.
- Around line 225-228: The loop in semsearch/queries.py that computes
field_logits can raise a math domain error when v == 0 because ratio becomes 0
and math.log(0) is invalid; update the logic in the for v in values loop (the
block that computes ratio, logit and assigns to field_logits[field_idx][v]) to
clamp ratio to a small positive epsilon (e.g., max(ratio, 1e-12)) or explicitly
handle v == 0 so you never call math.log(0), preserving the invert behavior when
computing logit.

In `@server/backend/src/cq_server/services/knowledge.py`:
- Line 11: Remove the redundant import of apply_confirmation and apply_flag from
cq.scoring in knowledge.py so those functions are only imported once; locate the
duplicate import statement that also imports apply_confirmation and apply_flag
and delete it, leaving the original import intact to avoid duplicate imports.

In `@server/backend/tests/db_helpers.py`:
- Line 69: The module is missing the KnowledgeUnit name used in _RepoBundle type
hints (e.g., async def get, list, upsert, delete), causing NameError when
annotations are evaluated; fix by importing KnowledgeUnit into this module (or
enable postponing evaluation of annotations via "from __future__ import
annotations" or wrap the annotations in quotes), so add an appropriate "import
KnowledgeUnit" from its defining module or convert the annotations to
forward-reference strings in the _RepoBundle methods.

In `@server/backend/tests/test_app.py`:
- Line 298: The initial assertion uses assert len(results) >= 1 but the test
immediately expects both ku1 and ku2 in result_confidences, so update the check
to assert len(results) >= 2 (or otherwise adjust the downstream expectations) to
ensure the test requires at least two results; modify the assertion near the top
of the test that references results so it matches the later checks for ku1 and
ku2.
- Line 301: The assertion is vacuous because both units use the default 0.5
confidence; instead assert that the semantic search actually ranked ku1 above
ku2 by checking the ordered results (e.g., verify results[0]["id"] == ku1["id"])
or compute and compare the semsearch-derived relevance scores rather than the
stored evidence.confidence; update the test in test_app.py to either assert the
top result id is ku1 or to extract the relevance score used by semsearch (see
semsearch/queries.py scoring logic) and compare those scores for ku1 and ku2,
and ensure you reference ku1, ku2, result_confidences, _propose_payload and
_approve_unit when locating the insertion/setup code to avoid relying on default
0.5 confidence.
- Around line 262-264: The test is calling the private API
semsearch.queries._get_embeddings and using a broad except Exception which can
mask real issues; change the call to the public API (e.g.,
semsearch.queries.get_embeddings) to avoid coupling to internals, and replace
the broad except Exception with a narrow catch for connectivity/service errors
such as ConnectionError, asyncio.TimeoutError and OSError so only
network/service failures trigger pytest.skip while other exceptions propagate.

In `@server/backend/tests/test_semsearch.py`:
- Around line 248-249: Replace the informal comment above the two logging lines
with a clear explanation that Alembic's env.py modifies/overrides Python logging
configuration during migration setup (disabling package loggers), which breaks
test logging; therefore tests must explicitly re-enable the specific loggers
"cq_server.semsearch" and "cq_server.semsearch.queries". Update the comment to
state that these two getLogger(...) calls are restoring those loggers for test
output and include a brief note that this is a targeted workaround limited to
the test environment.

---

Outside diff comments:
In `@server/backend/pyproject.toml`:
- Around line 44-48: Remove the duplicated dependency entry "jsonschema>=4.23.0"
from the tests dependency list in pyproject.toml so it appears only once; locate
the duplicate occurrences of "jsonschema>=4.23.0" in the tests
extras/dependencies block and delete the redundant line, leaving a single entry
(ensure commas/formatting around the remaining lines remain valid).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c4755acf-e515-4291-bbf6-315c16792b4d

📥 Commits

Reviewing files that changed from the base of the PR and between ff10e61 and 69f94ff.

⛔ Files ignored due to path filters (1)

server/backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (14)

Makefile
server/backend/README.md
server/backend/pyproject.toml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/src/cq_server/services/knowledge.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py
server/backend/tests/test_sqlite_store.py
server/backend/tests/test_store.py

coderabbitai · 2026-06-10T09:37:12Z

+        assert len(results) >= 1
+        assert ku1["id"] in result_confidences and ku2["id"] in result_confidences
+        assert results[0]["domains"] == ["astronomy"]
+        assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: assertion does not verify semantic ranking.

Line 301 checks that result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], but both units are inserted with the default confidence of 0.5 (from _propose_payload and _approve_unit). The semsearch scoring multiplies relevance by confidence but does not modify the returned unit's evidence.confidence field (see context snippet 3, semsearch/queries.py:129-193). Therefore, this assertion is 0.5 >= 0.5, which is trivially true and does not verify that ku1 ranks above ku2 based on semantic similarity. The test name and PR description indicate that semantic ranking should be validated, yet results[0]["id"] is never checked.

Verify that ku1 is the top-ranked result

assert len(results) >= 1 assert ku1["id"] in result_confidences and ku2["id"] in result_confidences assert results[0]["domains"] == ["astronomy"] - assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]] + assert results[0]["id"] == ku1["id"], "Expected ku1 (exoplanet transit) to rank higher than ku2 (HII enrichment) via semantic similarity"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]

assert len(results) >= 1

assert ku1["id"] in result_confidences and ku2["id"] in result_confidences

assert results[0]["domains"] == ["astronomy"]

assert results[0]["id"] == ku1["id"], "Expected ku1 (exoplanet transit) to rank higher than ku2 (HII enrichment) via semantic similarity"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@server/backend/tests/test_app.py` at line 301, The assertion is vacuous because both units use the default 0.5 confidence; instead assert that the semantic search actually ranked ku1 above ku2 by checking the ordered results (e.g., verify results[0]["id"] == ku1["id"]) or compute and compare the semsearch-derived relevance scores rather than the stored evidence.confidence; update the test in test_app.py to either assert the top result id is ku1 or to extract the relevance score used by semsearch (see semsearch/queries.py scoring logic) and compare those scores for ku1 and ku2, and ensure you reference ku1, ku2, result_confidences, _propose_payload and _approve_unit when locating the insertion/setup code to avoid relying on default 0.5 confidence.

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

server/backend/tests/test_app.py (1)
268-271: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix stale response parsing for /api/v1/knowledge in both query tests.

Line 268 and Line 313 still treat the response as a raw list, but this endpoint now returns an envelope ({"data": [...]}). That causes the KeyError: 0 seen in CI and leaves the semsearch-path test parsing the wrong shape.
Suggested patch
-        results = resp.json()
+        results = resp.json()["data"]
         assert len(results) == 1
         assert results[0]["domains"] == ["astronomy"]
         assert "exoplanets" in results[0]["insight"]["summary"].lower()
@@
-        results = resp.json()
+        results = resp.json()["data"]
         result_confidences = {result["id"]: result["evidence"]["confidence"] for result in results}
Also applies to: 313-314
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 268 - 271, The tests currently
parse resp.json() as a raw list but the /api/v1/knowledge endpoint now returns
an envelope {"data": [...]}; update both occurrences where results = resp.json()
(around the assertions at the block using results[0]["domains"] and the
semsearch-path test) to set results = resp.json().get("data", []) (or
resp.json()["data"]) before asserting length and accessing results[0], and
adjust any downstream assertions to use this extracted list.
Source: Pipeline failures
server/backend/pyproject.toml (3)
44-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Duplicate test dependency: jsonschema listed twice.

jsonschema>=4.23.0 appears at both lines 44 and 48 in the tests dependency group. This duplication is redundant and should be removed.
🔧 Proposed fix to remove duplicate entry
 tests = [
     "jsonschema>=4.23.0",
     "pytest>=9.0.3",
     "pytest-asyncio>=1.3.0",
     "httpx",
-    "jsonschema>=4.23.0",
 ]
Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` at line 44, The tests dependency list in
pyproject.toml contains a duplicated entry "jsonschema>=4.23.0"; remove the
redundant occurrence so "jsonschema>=4.23.0" appears only once in the tests
dependencies block, leaving the rest of the tests dependency list unchanged and
preserving formatting.
65-67: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Pytest CLI logging configuration contradicts explanation in past review.

Lines 65-67 set log_cli = true, log_cli_level = "DEBUG", and log_level = "DEBUG". In a past review, peteski22 questioned this configuration. The response stated "Logs will only be printed if -vv is used anyway."

However, log_cli = true unconditionally enables console logging during test runs—no -vv flag is required. With log_cli_level = "DEBUG", every test run will emit DEBUG-level logs to the console, increasing noise and slowing output.

If the intent is to enable debug logs only on demand, remove these settings and use pytest -o log_cli=true -o log_cli_level=DEBUG for one-off debugging.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 65 - 67, Remove the unconditional
pytest CLI logging settings so DEBUG logs are only enabled on demand: delete the
log_cli, log_cli_level, and log_level entries (the keys "log_cli",
"log_cli_level", and "log_level") from pyproject.toml and rely on invoking
pytest with flags like `pytest -o log_cli=true -o log_cli_level=DEBUG` when you
need verbose debug output.
1-68: ⚠️ Potential issue | 🔴 Critical

Fix test_query_finds_expected_unit_across_distinct_domains parsing of /api/v1/knowledge response
/api/v1/knowledge returns a KnowledgeUnitList envelope ({"data": [...]}) from server/backend/src/cq_server/api/routes/knowledge.py, but server/backend/tests/test_app.py::TestQuery::test_query_finds_expected_unit_across_distinct_domains treats resp.json() as a bare list (results[0]), triggering KeyError: 0. Update the test to read resp.json()["data"] before indexing, instead of tying this failure to semsearch activation or the cq-sdk bump.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 1 - 68, The test
TestQuery::test_query_finds_expected_unit_across_distinct_domains is indexing
resp.json() as if it were a bare list causing KeyError; update that test to
extract the KnowledgeUnitList envelope by reading resp.json()["data"] (i.e.,
assign results = resp.json()["data"] before indexing) so it matches the
/api/v1/knowledge route (server/backend/src/cq_server/api/routes/knowledge.py)
which returns {"data": [...]}; ensure any other assertions in that test use the
unwrapped list.
server/backend/src/cq_server/semsearch/__init__.py (1)
29-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not log the full embedding endpoint.

TOKEN_EMBEDDING_URL can contain credentials or signed query params. Emitting it verbatim at info level leaks secrets into application logs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 29, The current
logger.info call prints the full _TOKEN_EMBEDDING_URL which may contain secrets;
change the log to avoid emitting the full URL in __init__.py by either logging a
generic message like "Token embedding enabled using encoderfile endpoint" or log
a safely redacted version (e.g., only the host/port or a masked string) instead
of the full _TOKEN_EMBEDDING_URL; update the logger.info that references
_TOKEN_EMBEDDING_URL so it never dumps credentials or query params.

♻️ Duplicate comments (5)

server/backend/pyproject.toml (1)
24-24: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Unpinned httpx in semsearch extra persists despite past review.

Line 24 lists httpx without a version constraint. A past review comment by peteski22 on lines 23-23 flagged this exact issue, noting that httpx is pre-1.0 and recommended constraining it (e.g., httpx>=0.27,<1). The response "Specified" suggests this was addressed, but the constraint is still missing.

Pre-1.0 libraries can introduce breaking API changes. The semsearch module's AsyncClient usage could break on httpx updates.
🔒 Proposed fix to constrain httpx version
 semsearch = [
-    "httpx",
+    "httpx>=0.27,<1",
     "sqlite_vec~=0.1.9",
     "numpy==2.*",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` at line 24, The semsearch extra lists "httpx"
without a version constraint which is unsafe because httpx is pre-1.0; update
the dependency entry in pyproject.toml to a constrained range (for example
"httpx>=0.27,<1") so consumers get a stable API surface; ensure the semsearch
code paths that use httpx.AsyncClient still compile against the chosen range and
run tests to verify AsyncClient usage in the semsearch module remains compatible
with the pinned version.
server/backend/src/cq_server/semsearch/__init__.py (2)
94-101: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always disable extension loading in a finally block.

If sqlite_vec.load(conn) or ensure_schema(conn) fails, this connection is left with extension loading still enabled. That widens the attack surface for every later use of the same handle.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 94 - 101,
The load function leaves extension loading enabled if sqlite_vec.load(conn) or
ensure_schema(conn) raises; change load to enable extension loading, then call
sqlite_vec.load(conn) and ensure_schema(conn) inside a try block and always call
conn.enable_load_extension(False) in a finally block so the connection is never
left with extensions enabled; refer to the load function and the calls to
conn.enable_load_extension, sqlite_vec.load, and ensure_schema to locate and
implement the try/finally.
17-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard SEMSEARCH_EMBEDDING_DIM parsing at import.

A non-numeric value here raises during module import, so one bad env var can stop the server starting before _ENABLED is even consulted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 17, The
module-level line setting _DIM from os.environ["SEMSEARCH_EMBEDDING_DIM"] must
be guarded so a non-numeric env var doesn't raise at import; update the code
around _DIM to parse the env var inside a try/except (catch ValueError and
TypeError), fall back to the default 768 when parsing fails or the var is
missing, and optionally emit a warning (use the module logger or processLogger)
so startup continues and _ENABLED can still be checked; refer to the symbol _DIM
and the env name SEMSEARCH_EMBEDDING_DIM when applying the change.
server/backend/src/cq_server/semsearch/queries.py (1)
38-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set a timeout on the embedding HTTP call.

Without one, a slow or dead embedding service can hang semsearch requests indefinitely. In the current repository wiring that also means a DB connection stays tied up for the whole wait.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 38 - 40, The
embedding HTTP call using AsyncClient (created with base_url
_TOKEN_EMBEDDING_URL) currently awaits client.post("/predict",
json=request_data) with no timeout; add a sensible timeout (e.g., 5–10s) by
passing a timeout to AsyncClient(...) or to client.post(..., timeout=...) and
ensure you handle/propagate httpx timeout exceptions (httpx.TimeoutException) so
semsearch functions (where this code lives) don't hang and DB connections are
released.
server/backend/src/cq_server/repositories/knowledge.py (1)
68-70: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Do not hold synchronous engine connections open across await.

Both paths open a sync SQLAlchemy connection/transaction and then await semsearch helpers that perform network I/O. That bypasses Database.run_sync()’s threadpool boundary and can both block the event loop and exhaust the SQLite pool under concurrent load.

Also applies to: 82-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 68 - 70,
The code opens a synchronous DB transaction with self._db.engine.begin() and
then awaits semsearch helpers (sem_insert_unit / sem_update_unit), which holds
the sync connection across await; instead, perform all DB work inside the with
self._db.engine.begin() block, extract the minimal data needed for semsearch
(e.g., unit id or payload), exit/commit to close the connection, and only then
call await sem_insert_unit(...) or await sem_update_unit(...). Apply the same
change to both insertion (where _SEMSEARCH_ENABLED + sem_insert_unit is used)
and update paths (lines referenced by sem_update_unit), or alternatively
dispatch the semsearch call to a background task (asyncio.create_task) after
closing the sync connection so no await runs while the sync transaction is open.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 27-32: The Makefile now includes unrelated Pi install/uninstall
targets and help text (targets: install-pi, uninstall-pi and the "Pi:" help
block) that don't belong in the semantic-search PR; remove the added Pi help
block and the install-pi/uninstall-pi targets from this diff so the PR only
contains sqlite-vec/semantic similarity changes, and open a separate PR/branch
that adds the Pi host support (including the help text and targets) if needed,
ensuring the functions/names install-pi and uninstall-pi are not present in this
branch.
- Line 271: The Makefile contains unrelated refactor changes replacing direct
lint commands with the centralized wrapper (e.g., the command "bash
scripts/lint-python-component.sh scripts/install"); revert those substitutions
in this PR so the lint targets are restored to their original direct commands,
and open a separate PR that introduces the centralized wrapper and updates the
Makefile to call "bash scripts/lint-python-component.sh" for the relevant
targets.

In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 13-15: The code inserts canonical knowledge rows but calls
sem_insert_unit only after _insert_sync commits, so failed embedding generation
leaves a canonical unit without a knowledge_units_vec entry and subsequent
update() calls never repair it; fix by making embedding insertion part of the
same logical operation: call sem_insert_unit inside _insert_sync (or before
committing) and if sem_insert_unit fails then rollback the canonical insert (or
surface the error) so they stay consistent, and also update the update() path to
always attempt/ensure sem_insert_unit for the unit (i.e., have update() call
sem_insert_unit when a knowledge_units_vec is missing) so missing embeddings are
repaired; refer to functions _insert_sync, sem_insert_unit, update(), and the
knowledge_units_vec concept when applying the changes.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 183-192: The code currently passes raw distance into combine(),
which gives larger distances a positive boost because results are later sorted
descending; negate or invert the distance before combining so larger distances
reduce the score. Specifically, change the combine(...) call that currently
passes distance to instead pass a negated or inverted value (e.g. -distance or
an appropriate normalized inverse like 1 - distance depending on distance scale)
so that calculate_relevance(u, ...) remains positive for matches and the
distance term diminishes the combined score rather than promoting distant units.

---

Outside diff comments:
In `@server/backend/pyproject.toml`:
- Line 44: The tests dependency list in pyproject.toml contains a duplicated
entry "jsonschema>=4.23.0"; remove the redundant occurrence so
"jsonschema>=4.23.0" appears only once in the tests dependencies block, leaving
the rest of the tests dependency list unchanged and preserving formatting.
- Around line 65-67: Remove the unconditional pytest CLI logging settings so
DEBUG logs are only enabled on demand: delete the log_cli, log_cli_level, and
log_level entries (the keys "log_cli", "log_cli_level", and "log_level") from
pyproject.toml and rely on invoking pytest with flags like `pytest -o
log_cli=true -o log_cli_level=DEBUG` when you need verbose debug output.
- Around line 1-68: The test
TestQuery::test_query_finds_expected_unit_across_distinct_domains is indexing
resp.json() as if it were a bare list causing KeyError; update that test to
extract the KnowledgeUnitList envelope by reading resp.json()["data"] (i.e.,
assign results = resp.json()["data"] before indexing) so it matches the
/api/v1/knowledge route (server/backend/src/cq_server/api/routes/knowledge.py)
which returns {"data": [...]}; ensure any other assertions in that test use the
unwrapped list.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Line 29: The current logger.info call prints the full _TOKEN_EMBEDDING_URL
which may contain secrets; change the log to avoid emitting the full URL in
__init__.py by either logging a generic message like "Token embedding enabled
using encoderfile endpoint" or log a safely redacted version (e.g., only the
host/port or a masked string) instead of the full _TOKEN_EMBEDDING_URL; update
the logger.info that references _TOKEN_EMBEDDING_URL so it never dumps
credentials or query params.

In `@server/backend/tests/test_app.py`:
- Around line 268-271: The tests currently parse resp.json() as a raw list but
the /api/v1/knowledge endpoint now returns an envelope {"data": [...]}; update
both occurrences where results = resp.json() (around the assertions at the block
using results[0]["domains"] and the semsearch-path test) to set results =
resp.json().get("data", []) (or resp.json()["data"]) before asserting length and
accessing results[0], and adjust any downstream assertions to use this extracted
list.

---

Duplicate comments:
In `@server/backend/pyproject.toml`:
- Line 24: The semsearch extra lists "httpx" without a version constraint which
is unsafe because httpx is pre-1.0; update the dependency entry in
pyproject.toml to a constrained range (for example "httpx>=0.27,<1") so
consumers get a stable API surface; ensure the semsearch code paths that use
httpx.AsyncClient still compile against the chosen range and run tests to verify
AsyncClient usage in the semsearch module remains compatible with the pinned
version.

In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 68-70: The code opens a synchronous DB transaction with
self._db.engine.begin() and then awaits semsearch helpers (sem_insert_unit /
sem_update_unit), which holds the sync connection across await; instead, perform
all DB work inside the with self._db.engine.begin() block, extract the minimal
data needed for semsearch (e.g., unit id or payload), exit/commit to close the
connection, and only then call await sem_insert_unit(...) or await
sem_update_unit(...). Apply the same change to both insertion (where
_SEMSEARCH_ENABLED + sem_insert_unit is used) and update paths (lines referenced
by sem_update_unit), or alternatively dispatch the semsearch call to a
background task (asyncio.create_task) after closing the sync connection so no
await runs while the sync transaction is open.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 94-101: The load function leaves extension loading enabled if
sqlite_vec.load(conn) or ensure_schema(conn) raises; change load to enable
extension loading, then call sqlite_vec.load(conn) and ensure_schema(conn)
inside a try block and always call conn.enable_load_extension(False) in a
finally block so the connection is never left with extensions enabled; refer to
the load function and the calls to conn.enable_load_extension, sqlite_vec.load,
and ensure_schema to locate and implement the try/finally.
- Line 17: The module-level line setting _DIM from
os.environ["SEMSEARCH_EMBEDDING_DIM"] must be guarded so a non-numeric env var
doesn't raise at import; update the code around _DIM to parse the env var inside
a try/except (catch ValueError and TypeError), fall back to the default 768 when
parsing fails or the var is missing, and optionally emit a warning (use the
module logger or processLogger) so startup continues and _ENABLED can still be
checked; refer to the symbol _DIM and the env name SEMSEARCH_EMBEDDING_DIM when
applying the change.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 38-40: The embedding HTTP call using AsyncClient (created with
base_url _TOKEN_EMBEDDING_URL) currently awaits client.post("/predict",
json=request_data) with no timeout; add a sensible timeout (e.g., 5–10s) by
passing a timeout to AsyncClient(...) or to client.post(..., timeout=...) and
ensure you handle/propagate httpx timeout exceptions (httpx.TimeoutException) so
semsearch functions (where this code lives) don't hang and DB connections are
released.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 60213c82-e1b7-4106-bae8-3ec21c6fd236

📥 Commits

Reviewing files that changed from the base of the PR and between 69f94ff and 82ac6f3.

⛔ Files ignored due to path filters (2)

server/backend/.DS_Store is excluded by !**/.DS_Store
server/backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

Makefile
server/backend/pyproject.toml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py

coderabbitai

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

server/backend/tests/test_app.py (1)
268-271: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix stale response parsing for /api/v1/knowledge in both query tests.

Line 268 and Line 313 still treat the response as a raw list, but this endpoint now returns an envelope ({"data": [...]}). That causes the KeyError: 0 seen in CI and leaves the semsearch-path test parsing the wrong shape.
Suggested patch
-        results = resp.json()
+        results = resp.json()["data"]
         assert len(results) == 1
         assert results[0]["domains"] == ["astronomy"]
         assert "exoplanets" in results[0]["insight"]["summary"].lower()
@@
-        results = resp.json()
+        results = resp.json()["data"]
         result_confidences = {result["id"]: result["evidence"]["confidence"] for result in results}
Also applies to: 313-314
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 268 - 271, The tests currently
parse resp.json() as a raw list but the /api/v1/knowledge endpoint now returns
an envelope {"data": [...]}; update both occurrences where results = resp.json()
(around the assertions at the block using results[0]["domains"] and the
semsearch-path test) to set results = resp.json().get("data", []) (or
resp.json()["data"]) before asserting length and accessing results[0], and
adjust any downstream assertions to use this extracted list.
Source: Pipeline failures
server/backend/pyproject.toml (3)
44-44: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Duplicate test dependency: jsonschema listed twice.

jsonschema>=4.23.0 appears at both lines 44 and 48 in the tests dependency group. This duplication is redundant and should be removed.
🔧 Proposed fix to remove duplicate entry
 tests = [
     "jsonschema>=4.23.0",
     "pytest>=9.0.3",
     "pytest-asyncio>=1.3.0",
     "httpx",
-    "jsonschema>=4.23.0",
 ]
Also applies to: 48-48
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` at line 44, The tests dependency list in
pyproject.toml contains a duplicated entry "jsonschema>=4.23.0"; remove the
redundant occurrence so "jsonschema>=4.23.0" appears only once in the tests
dependencies block, leaving the rest of the tests dependency list unchanged and
preserving formatting.
65-67: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Pytest CLI logging configuration contradicts explanation in past review.

Lines 65-67 set log_cli = true, log_cli_level = "DEBUG", and log_level = "DEBUG". In a past review, peteski22 questioned this configuration. The response stated "Logs will only be printed if -vv is used anyway."

However, log_cli = true unconditionally enables console logging during test runs—no -vv flag is required. With log_cli_level = "DEBUG", every test run will emit DEBUG-level logs to the console, increasing noise and slowing output.

If the intent is to enable debug logs only on demand, remove these settings and use pytest -o log_cli=true -o log_cli_level=DEBUG for one-off debugging.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 65 - 67, Remove the unconditional
pytest CLI logging settings so DEBUG logs are only enabled on demand: delete the
log_cli, log_cli_level, and log_level entries (the keys "log_cli",
"log_cli_level", and "log_level") from pyproject.toml and rely on invoking
pytest with flags like `pytest -o log_cli=true -o log_cli_level=DEBUG` when you
need verbose debug output.
1-68: ⚠️ Potential issue | 🔴 Critical

Fix test_query_finds_expected_unit_across_distinct_domains parsing of /api/v1/knowledge response
/api/v1/knowledge returns a KnowledgeUnitList envelope ({"data": [...]}) from server/backend/src/cq_server/api/routes/knowledge.py, but server/backend/tests/test_app.py::TestQuery::test_query_finds_expected_unit_across_distinct_domains treats resp.json() as a bare list (results[0]), triggering KeyError: 0. Update the test to read resp.json()["data"] before indexing, instead of tying this failure to semsearch activation or the cq-sdk bump.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 1 - 68, The test
TestQuery::test_query_finds_expected_unit_across_distinct_domains is indexing
resp.json() as if it were a bare list causing KeyError; update that test to
extract the KnowledgeUnitList envelope by reading resp.json()["data"] (i.e.,
assign results = resp.json()["data"] before indexing) so it matches the
/api/v1/knowledge route (server/backend/src/cq_server/api/routes/knowledge.py)
which returns {"data": [...]}; ensure any other assertions in that test use the
unwrapped list.
server/backend/src/cq_server/semsearch/__init__.py (1)
29-29: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not log the full embedding endpoint.

TOKEN_EMBEDDING_URL can contain credentials or signed query params. Emitting it verbatim at info level leaks secrets into application logs.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 29, The current
logger.info call prints the full _TOKEN_EMBEDDING_URL which may contain secrets;
change the log to avoid emitting the full URL in __init__.py by either logging a
generic message like "Token embedding enabled using encoderfile endpoint" or log
a safely redacted version (e.g., only the host/port or a masked string) instead
of the full _TOKEN_EMBEDDING_URL; update the logger.info that references
_TOKEN_EMBEDDING_URL so it never dumps credentials or query params.

♻️ Duplicate comments (5)

server/backend/pyproject.toml (1)
24-24: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Unpinned httpx in semsearch extra persists despite past review.

Line 24 lists httpx without a version constraint. A past review comment by peteski22 on lines 23-23 flagged this exact issue, noting that httpx is pre-1.0 and recommended constraining it (e.g., httpx>=0.27,<1). The response "Specified" suggests this was addressed, but the constraint is still missing.

Pre-1.0 libraries can introduce breaking API changes. The semsearch module's AsyncClient usage could break on httpx updates.
🔒 Proposed fix to constrain httpx version
 semsearch = [
-    "httpx",
+    "httpx>=0.27,<1",
     "sqlite_vec~=0.1.9",
     "numpy==2.*",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` at line 24, The semsearch extra lists "httpx"
without a version constraint which is unsafe because httpx is pre-1.0; update
the dependency entry in pyproject.toml to a constrained range (for example
"httpx>=0.27,<1") so consumers get a stable API surface; ensure the semsearch
code paths that use httpx.AsyncClient still compile against the chosen range and
run tests to verify AsyncClient usage in the semsearch module remains compatible
with the pinned version.
server/backend/src/cq_server/semsearch/__init__.py (2)
94-101: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always disable extension loading in a finally block.

If sqlite_vec.load(conn) or ensure_schema(conn) fails, this connection is left with extension loading still enabled. That widens the attack surface for every later use of the same handle.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 94 - 101,
The load function leaves extension loading enabled if sqlite_vec.load(conn) or
ensure_schema(conn) raises; change load to enable extension loading, then call
sqlite_vec.load(conn) and ensure_schema(conn) inside a try block and always call
conn.enable_load_extension(False) in a finally block so the connection is never
left with extensions enabled; refer to the load function and the calls to
conn.enable_load_extension, sqlite_vec.load, and ensure_schema to locate and
implement the try/finally.
17-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard SEMSEARCH_EMBEDDING_DIM parsing at import.

A non-numeric value here raises during module import, so one bad env var can stop the server starting before _ENABLED is even consulted.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 17, The
module-level line setting _DIM from os.environ["SEMSEARCH_EMBEDDING_DIM"] must
be guarded so a non-numeric env var doesn't raise at import; update the code
around _DIM to parse the env var inside a try/except (catch ValueError and
TypeError), fall back to the default 768 when parsing fails or the var is
missing, and optionally emit a warning (use the module logger or processLogger)
so startup continues and _ENABLED can still be checked; refer to the symbol _DIM
and the env name SEMSEARCH_EMBEDDING_DIM when applying the change.
server/backend/src/cq_server/semsearch/queries.py (1)
38-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Set a timeout on the embedding HTTP call.

Without one, a slow or dead embedding service can hang semsearch requests indefinitely. In the current repository wiring that also means a DB connection stays tied up for the whole wait.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 38 - 40, The
embedding HTTP call using AsyncClient (created with base_url
_TOKEN_EMBEDDING_URL) currently awaits client.post("/predict",
json=request_data) with no timeout; add a sensible timeout (e.g., 5–10s) by
passing a timeout to AsyncClient(...) or to client.post(..., timeout=...) and
ensure you handle/propagate httpx timeout exceptions (httpx.TimeoutException) so
semsearch functions (where this code lives) don't hang and DB connections are
released.
server/backend/src/cq_server/repositories/knowledge.py (1)
68-70: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Do not hold synchronous engine connections open across await.

Both paths open a sync SQLAlchemy connection/transaction and then await semsearch helpers that perform network I/O. That bypasses Database.run_sync()’s threadpool boundary and can both block the event loop and exhaust the SQLite pool under concurrent load.

Also applies to: 82-86
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 68 - 70,
The code opens a synchronous DB transaction with self._db.engine.begin() and
then awaits semsearch helpers (sem_insert_unit / sem_update_unit), which holds
the sync connection across await; instead, perform all DB work inside the with
self._db.engine.begin() block, extract the minimal data needed for semsearch
(e.g., unit id or payload), exit/commit to close the connection, and only then
call await sem_insert_unit(...) or await sem_update_unit(...). Apply the same
change to both insertion (where _SEMSEARCH_ENABLED + sem_insert_unit is used)
and update paths (lines referenced by sem_update_unit), or alternatively
dispatch the semsearch call to a background task (asyncio.create_task) after
closing the sync connection so no await runs while the sync transaction is open.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Makefile`:
- Around line 27-32: The Makefile now includes unrelated Pi install/uninstall
targets and help text (targets: install-pi, uninstall-pi and the "Pi:" help
block) that don't belong in the semantic-search PR; remove the added Pi help
block and the install-pi/uninstall-pi targets from this diff so the PR only
contains sqlite-vec/semantic similarity changes, and open a separate PR/branch
that adds the Pi host support (including the help text and targets) if needed,
ensuring the functions/names install-pi and uninstall-pi are not present in this
branch.
- Line 271: The Makefile contains unrelated refactor changes replacing direct
lint commands with the centralized wrapper (e.g., the command "bash
scripts/lint-python-component.sh scripts/install"); revert those substitutions
in this PR so the lint targets are restored to their original direct commands,
and open a separate PR that introduces the centralized wrapper and updates the
Makefile to call "bash scripts/lint-python-component.sh" for the relevant
targets.

In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 13-15: The code inserts canonical knowledge rows but calls
sem_insert_unit only after _insert_sync commits, so failed embedding generation
leaves a canonical unit without a knowledge_units_vec entry and subsequent
update() calls never repair it; fix by making embedding insertion part of the
same logical operation: call sem_insert_unit inside _insert_sync (or before
committing) and if sem_insert_unit fails then rollback the canonical insert (or
surface the error) so they stay consistent, and also update the update() path to
always attempt/ensure sem_insert_unit for the unit (i.e., have update() call
sem_insert_unit when a knowledge_units_vec is missing) so missing embeddings are
repaired; refer to functions _insert_sync, sem_insert_unit, update(), and the
knowledge_units_vec concept when applying the changes.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 183-192: The code currently passes raw distance into combine(),
which gives larger distances a positive boost because results are later sorted
descending; negate or invert the distance before combining so larger distances
reduce the score. Specifically, change the combine(...) call that currently
passes distance to instead pass a negated or inverted value (e.g. -distance or
an appropriate normalized inverse like 1 - distance depending on distance scale)
so that calculate_relevance(u, ...) remains positive for matches and the
distance term diminishes the combined score rather than promoting distant units.

---

Outside diff comments:
In `@server/backend/pyproject.toml`:
- Line 44: The tests dependency list in pyproject.toml contains a duplicated
entry "jsonschema>=4.23.0"; remove the redundant occurrence so
"jsonschema>=4.23.0" appears only once in the tests dependencies block, leaving
the rest of the tests dependency list unchanged and preserving formatting.
- Around line 65-67: Remove the unconditional pytest CLI logging settings so
DEBUG logs are only enabled on demand: delete the log_cli, log_cli_level, and
log_level entries (the keys "log_cli", "log_cli_level", and "log_level") from
pyproject.toml and rely on invoking pytest with flags like `pytest -o
log_cli=true -o log_cli_level=DEBUG` when you need verbose debug output.
- Around line 1-68: The test
TestQuery::test_query_finds_expected_unit_across_distinct_domains is indexing
resp.json() as if it were a bare list causing KeyError; update that test to
extract the KnowledgeUnitList envelope by reading resp.json()["data"] (i.e.,
assign results = resp.json()["data"] before indexing) so it matches the
/api/v1/knowledge route (server/backend/src/cq_server/api/routes/knowledge.py)
which returns {"data": [...]}; ensure any other assertions in that test use the
unwrapped list.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Line 29: The current logger.info call prints the full _TOKEN_EMBEDDING_URL
which may contain secrets; change the log to avoid emitting the full URL in
__init__.py by either logging a generic message like "Token embedding enabled
using encoderfile endpoint" or log a safely redacted version (e.g., only the
host/port or a masked string) instead of the full _TOKEN_EMBEDDING_URL; update
the logger.info that references _TOKEN_EMBEDDING_URL so it never dumps
credentials or query params.

In `@server/backend/tests/test_app.py`:
- Around line 268-271: The tests currently parse resp.json() as a raw list but
the /api/v1/knowledge endpoint now returns an envelope {"data": [...]}; update
both occurrences where results = resp.json() (around the assertions at the block
using results[0]["domains"] and the semsearch-path test) to set results =
resp.json().get("data", []) (or resp.json()["data"]) before asserting length and
accessing results[0], and adjust any downstream assertions to use this extracted
list.

---

Duplicate comments:
In `@server/backend/pyproject.toml`:
- Line 24: The semsearch extra lists "httpx" without a version constraint which
is unsafe because httpx is pre-1.0; update the dependency entry in
pyproject.toml to a constrained range (for example "httpx>=0.27,<1") so
consumers get a stable API surface; ensure the semsearch code paths that use
httpx.AsyncClient still compile against the chosen range and run tests to verify
AsyncClient usage in the semsearch module remains compatible with the pinned
version.

In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 68-70: The code opens a synchronous DB transaction with
self._db.engine.begin() and then awaits semsearch helpers (sem_insert_unit /
sem_update_unit), which holds the sync connection across await; instead, perform
all DB work inside the with self._db.engine.begin() block, extract the minimal
data needed for semsearch (e.g., unit id or payload), exit/commit to close the
connection, and only then call await sem_insert_unit(...) or await
sem_update_unit(...). Apply the same change to both insertion (where
_SEMSEARCH_ENABLED + sem_insert_unit is used) and update paths (lines referenced
by sem_update_unit), or alternatively dispatch the semsearch call to a
background task (asyncio.create_task) after closing the sync connection so no
await runs while the sync transaction is open.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 94-101: The load function leaves extension loading enabled if
sqlite_vec.load(conn) or ensure_schema(conn) raises; change load to enable
extension loading, then call sqlite_vec.load(conn) and ensure_schema(conn)
inside a try block and always call conn.enable_load_extension(False) in a
finally block so the connection is never left with extensions enabled; refer to
the load function and the calls to conn.enable_load_extension, sqlite_vec.load,
and ensure_schema to locate and implement the try/finally.
- Line 17: The module-level line setting _DIM from
os.environ["SEMSEARCH_EMBEDDING_DIM"] must be guarded so a non-numeric env var
doesn't raise at import; update the code around _DIM to parse the env var inside
a try/except (catch ValueError and TypeError), fall back to the default 768 when
parsing fails or the var is missing, and optionally emit a warning (use the
module logger or processLogger) so startup continues and _ENABLED can still be
checked; refer to the symbol _DIM and the env name SEMSEARCH_EMBEDDING_DIM when
applying the change.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 38-40: The embedding HTTP call using AsyncClient (created with
base_url _TOKEN_EMBEDDING_URL) currently awaits client.post("/predict",
json=request_data) with no timeout; add a sensible timeout (e.g., 5–10s) by
passing a timeout to AsyncClient(...) or to client.post(..., timeout=...) and
ensure you handle/propagate httpx timeout exceptions (httpx.TimeoutException) so
semsearch functions (where this code lives) don't hang and DB connections are
released.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 60213c82-e1b7-4106-bae8-3ec21c6fd236

📥 Commits

Reviewing files that changed from the base of the PR and between 69f94ff and 82ac6f3.

⛔ Files ignored due to path filters (2)

server/backend/.DS_Store is excluded by !**/.DS_Store
server/backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (10)

Makefile
server/backend/pyproject.toml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py

🛑 Comments failed to post (4)

Makefile (2)
27-32: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Unrelated changes: Pi installation targets bundled in semantic search PR.

Lines 27-32, 176-191, 195, and 199 add Pi installation/uninstallation targets and help text. These changes are unrelated to the stated PR objective (semantic similarity queries via sqlite-vec). Bundling unrelated changes reduces reviewability and complicates rollback if issues arise.

Consider moving these changes to a separate PR focused on Pi host support.

Also applies to: 176-191, 195-195, 199-199
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` around lines 27 - 32, The Makefile now includes unrelated Pi
install/uninstall targets and help text (targets: install-pi, uninstall-pi and
the "Pi:" help block) that don't belong in the semantic-search PR; remove the
added Pi help block and the install-pi/uninstall-pi targets from this diff so
the PR only contains sqlite-vec/semantic similarity changes, and open a separate
PR/branch that adds the Pi host support (including the help text and targets) if
needed, ensuring the functions/names install-pi and uninstall-pi are not present
in this branch.
271-271: 🧹 Nitpick | 🔵 Trivial | ⚖️ Poor tradeoff

Unrelated changes: Linting script refactor bundled in semantic search PR.

Lines 271, 275, 283, 294, and 298 refactor lint targets to use a centralised bash scripts/lint-python-component.sh wrapper instead of direct commands. This refactoring is unrelated to the stated PR objective (semantic similarity queries).

Consider moving these changes to a separate PR focused on lint infrastructure improvements.

Also applies to: 275-275, 283-283, 294-294, 298-298
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@Makefile` at line 271, The Makefile contains unrelated refactor changes
replacing direct lint commands with the centralized wrapper (e.g., the command
"bash scripts/lint-python-component.sh scripts/install"); revert those
substitutions in this PR so the lint targets are restored to their original
direct commands, and open a separate PR that introduces the centralized wrapper
and updates the Makefile to call "bash scripts/lint-python-component.sh" for the
relevant targets.
server/backend/src/cq_server/repositories/knowledge.py (1)
13-15: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The semsearch index is not kept in sync with the source-of-truth row.

This wiring only ever calls insert_unit, and it does so after _insert_sync has already committed. If embedding generation fails, the canonical unit survives without a knowledge_units_vec row, and later update() calls never repair or refresh that embedding.

Also applies to: 67-70

🧰 Tools

🪛 GitHub Actions: Server CI / 1_Test (3.13).txt

[error] Test command failed: 'uv run pytest' exited with code 2 due to 1 failing test (297 collected; 1 failed).

🪛 GitHub Actions: Server CI / Test (3.11)

[error] Command failed: 'uv run pytest' (exit code 1). 1 test failed: tests/test_app.py::TestQuery::test_query_finds_expected_unit_across_distinct_domains (KeyError: 0).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 13 - 15,
The code inserts canonical knowledge rows but calls sem_insert_unit only after
_insert_sync commits, so failed embedding generation leaves a canonical unit
without a knowledge_units_vec entry and subsequent update() calls never repair
it; fix by making embedding insertion part of the same logical operation: call
sem_insert_unit inside _insert_sync (or before committing) and if
sem_insert_unit fails then rollback the canonical insert (or surface the error)
so they stay consistent, and also update the update() path to always
attempt/ensure sem_insert_unit for the unit (i.e., have update() call
sem_insert_unit when a knowledge_units_vec is missing) so missing embeddings are
repaired; refer to functions _insert_sync, sem_insert_unit, update(), and the
knowledge_units_vec concept when applying the changes.
server/backend/src/cq_server/semsearch/queries.py (1)
183-192: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Distance is being scored in the wrong direction.

combine() adds a positive term for larger distances and the results are then sorted descending, so semantically farther units are promoted. The helpers in this module are already shaped around the opposite contract: low distance should boost, high distance should diminish.

🧰 Tools

🪛 GitHub Actions: Server CI / 1_Test (3.13).txt

[error] Test command failed: 'uv run pytest' exited with code 2 due to 1 failing test (297 collected; 1 failed).

🪛 GitHub Actions: Server CI / Test (3.11)

[error] Command failed: 'uv run pytest' (exit code 1). 1 test failed: tests/test_app.py::TestQuery::test_query_finds_expected_unit_across_distinct_domains (KeyError: 0).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 183 - 192,
The code currently passes raw distance into combine(), which gives larger
distances a positive boost because results are later sorted descending; negate
or invert the distance before combining so larger distances reduce the score.
Specifically, change the combine(...) call that currently passes distance to
instead pass a negated or inverted value (e.g. -distance or an appropriate
normalized inverse like 1 - distance depending on distance scale) so that
calculate_relevance(u, ...) remains positive for matches and the distance term
diminishes the combined score rather than promoting distant units.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

server/backend/tests/test_app.py (2)
320-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assertion does not verify semantic ranking.

Line 320 checks result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], but both units are inserted with the default confidence of 0.5 (from _propose_payload and _approve_unit). The semsearch scoring multiplies relevance by confidence but does not modify the returned unit's evidence.confidence field. Therefore, this assertion is 0.5 >= 0.5, which is trivially true and does not verify that ku1 ranks above ku2 based on semantic similarity. To properly validate semantic ranking, assert that ku1 is the top-ranked result.
🐛 Proposed fix — verify ku1 is the top-ranked result
         assert ku1["id"] in result_confidences and ku2["id"] in result_confidences
         assert result["domains"] == ["astronomy"]
-        assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]
+        assert result["id"] == ku1["id"], "Expected ku1 (exoplanet transit) to rank higher than ku2 (HII enrichment) via semantic similarity"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` at line 320, The current assertion compares
result_confidences[ku1["id"]] >= result_confidences[ku2["id"]], which is
trivially true because both confidences default to 0.5; instead change the test
to assert semantic ranking by verifying that ku1 is the top-ranked result (e.g.,
assert that the first search result's id equals ku1["id"]). Update the assertion
in the test (around the existing result_confidences usage and after the search
call) to check the ordering of search results directly rather than comparing
evidence.confidence values produced by _propose_payload/_approve_unit and used
by semsearch. Ensure you reference ku1["id"] as the expected top result.
316-318: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assertion threshold inconsistent with subsequent checks.

Line 316 asserts len(results['data']) >= 1, but line 318 immediately requires both ku1 and ku2 to be present in result_confidences, implying at least 2 results are expected. This inconsistency could mask a failure where only 1 result is returned.
🧹 Proposed fix
-        assert len(results['data']) >= 1
+        assert len(results['data']) >= 2
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 316 - 318, The assertion
expecting at least 1 result is inconsistent with the subsequent check that both
ku1 and ku2 must appear; update the test to require at least 2 results (e.g.,
change the length assertion to assert len(results['data']) >= 2) so it aligns
with the later membership checks for ku1["id"] and ku2["id"] in
result_confidences, ensuring the test fails when too few results are returned.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/backend/tests/test_app.py`:
- Around line 268-272: The test is incorrectly asserting len(results) == 1 on
the full JSON response object; update the assertion to check the number of items
in the data list instead. Replace the check with an assertion on results['data']
(e.g., assert len(results['data']) == 1 or assert len(results.get('data', []))
== 1) so that the test validates the actual result count before using result =
results['data'][0]; keep the subsequent assertions on result["domains"] and
result["insight"]["summary"] unchanged.

---

Duplicate comments:
In `@server/backend/tests/test_app.py`:
- Line 320: The current assertion compares result_confidences[ku1["id"]] >=
result_confidences[ku2["id"]], which is trivially true because both confidences
default to 0.5; instead change the test to assert semantic ranking by verifying
that ku1 is the top-ranked result (e.g., assert that the first search result's
id equals ku1["id"]). Update the assertion in the test (around the existing
result_confidences usage and after the search call) to check the ordering of
search results directly rather than comparing evidence.confidence values
produced by _propose_payload/_approve_unit and used by semsearch. Ensure you
reference ku1["id"] as the expected top result.
- Around line 316-318: The assertion expecting at least 1 result is inconsistent
with the subsequent check that both ku1 and ku2 must appear; update the test to
require at least 2 results (e.g., change the length assertion to assert
len(results['data']) >= 2) so it aligns with the later membership checks for
ku1["id"] and ku2["id"] in result_confidences, ensuring the test fails when too
few results are returned.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c57f750a-53c2-41b2-ba15-838985affffc

📥 Commits

Reviewing files that changed from the base of the PR and between 82ac6f3 and 399d7b1.

📒 Files selected for processing (2)

server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/test_app.py

coderabbitai · 2026-06-11T07:46:28Z

Note

Docstrings generation - SUCCESS
Generated docstrings and committed to branch 21-semantic-similarity-queries (commit: 7dc8563aa48c0f06c5b5cb674e4b5d800a2a96ca)

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

server/backend/pyproject.toml (1)
43-49: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Remove the duplicated jsonschema test dependency entry.

tests declares jsonschema>=4.23.0 twice (Line 44 and Line 48). Keep one entry to avoid redundant dependency resolution noise and lock churn.
Proposed fix
 tests = [
     "jsonschema>=4.23.0",
     "pytest>=9.0.3",
     "pytest-asyncio>=1.3.0",
     "httpx",
-    "jsonschema>=4.23.0",
 ]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/pyproject.toml` around lines 43 - 49, In the tests dependency
list (the "tests" table entry) remove the duplicate "jsonschema>=4.23.0" so the
dependency appears only once; locate the tests = [...] block and delete the
redundant "jsonschema>=4.23.0" string, leaving a single instance to avoid
duplicate dependency entries and unnecessary lock churn.
server/backend/src/cq_server/semsearch/__init__.py (1)
29-29: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid logging potentially sensitive URL.

If TOKEN_EMBEDDING_URL contains authentication credentials (e.g., http://user:pass@host/), they will be logged at INFO level. Consider sanitising the URL before logging or logging only the host portion.
🛡️ Proposed fix
-        logger.info(f"Token embedding enabled using encoderfile endpoint at {_TOKEN_EMBEDDING_URL}")
+        from urllib.parse import urlparse
+        parsed = urlparse(_TOKEN_EMBEDDING_URL)
+        safe_url = f"{parsed.scheme}://{parsed.hostname}" + (f":{parsed.port}" if parsed.port else "")
+        logger.info(f"Token embedding enabled using encoderfile endpoint at {safe_url}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` at line 29, The log
statement prints the full _TOKEN_EMBEDDING_URL which may contain credentials;
update the logging in __init__.py to sanitize the URL before logging by parsing
_TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and log only the safe parts
(hostname and port or reconstructed URL without username/password) or a redacted
form, replacing credentials with placeholders; change the logger.info call that
references _TOKEN_EMBEDDING_URL to use the sanitized/redacted value instead.
server/backend/src/cq_server/core/db.py (1)
18-19: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Both files import private _ENABLED instead of using the public is_enabled() API. The semsearch module exposes is_enabled() as its public interface for checking enablement status. Both db.py and knowledge.py import the private _ENABLED variable directly, creating unnecessary coupling to the module's internal implementation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/core/db.py` around lines 18 - 19, Replace direct
imports of the private _ENABLED symbol with the semsearch public API
is_enabled() — e.g., remove "from ..semsearch import _ENABLED" and instead
import and call "is_enabled" where enable checks occur (use "from ..semsearch
import is_enabled as semsearch_is_enabled" or call semsearch.is_enabled()).
Update both db.py (references to _ENABLED and any conditional logic around
semsearch_load) and knowledge.py to call semsearch_is_enabled() (or
semsearch.is_enabled()) before using semsearch_load/other features so you rely
on the public API rather than the private _ENABLED variable.
server/backend/src/cq_server/semsearch/queries.py (2)
108-108: 🧹 Nitpick | 🔵 Trivial | 💤 Low value

Remove stale TODO comment.

The comment # check type of sqlalchemy conns appears to be a leftover reminder that was previously flagged. The type annotation has been added to conn: Connection, so this comment should be removed.
🧹 Proposed fix
-# check type of sqlalchemy conns
 async def query(
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` at line 108, Remove the
stale TODO comment "# check type of sqlalchemy conns" from the code near the
Connection-typed variable (conn: Connection) in queries.py; since the type
annotation is already present on conn, simply delete that comment line to avoid
misleading TODOs and leave the surrounding code and the conn: Connection
annotation unchanged.
53-56: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty embeddings list causes np.average to produce unexpected results.

If results[0]["embeddings"] is an empty list (service returns a result but with no embeddings), np.average on an empty array will issue a warning and return nan, which then propagates to the serialisation and SQL insertion. Consider validating that embeddings are present.
🛡️ Proposed fix
         results = response.json().get("results")
         if not results:
             raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
+        embeddings_list = results[0].get("embeddings", [])
+        if not embeddings_list:
+            raise RuntimeError(f"Embedding API returned empty embeddings for input: {request_data}")
-        return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0)
+        return np.average(np.array([embedding.get("embedding") for embedding in embeddings_list]), axis=0)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 53 - 56, The
code currently computes np.average over results[0]["embeddings"] which can be an
empty list and yield NaN; update the logic in the function that handles the API
response (look for variables/results usage: response.json(), results,
request_data, and the return line using np.average) to explicitly validate that
results is non-empty and that results[0]["embeddings"] is a non-empty list
before calling np.average; if embeddings is empty, raise a RuntimeError
(including request_data and the raw response/results for debugging) so
downstream serialization/SQL insertion never receives NaN.
server/backend/tests/test_semsearch.py (1)
218-221: ⚠️ Potential issue | 🟡 Minor

Narrow the skip exception scope for _get_embeddings to only embedding/service failures

Catching Exception here can mask genuine semsearch regressions (e.g., unexpected response/JSON/embedding structure issues). _get_embeddings raises RuntimeError for “not enabled”/“no embeddings”, and otherwise relies on httpx to raise its own request/HTTP status errors—so skip only on those instead of all exceptions.
import httpx

try:
    await semsearch_queries._get_embeddings(["connectivity check"])
except (RuntimeError, httpx.RequestError, httpx.HTTPStatusError) as exc:
    pytest.skip(f"embedding server unavailable: {exc}")
Also consider aligning server/backend/tests/test_app.py—it currently catches built-in ConnectionError/TimeoutError, which won’t match httpx’s exception hierarchy.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_semsearch.py` around lines 218 - 221, Narrow the
broad except in the test that calls semsearch_queries._get_embeddings so it only
skips when the embedding/service is unavailable: catch RuntimeError and httpx
request/HTTP status errors instead of Exception (e.g., catch RuntimeError,
httpx.RequestError, httpx.HTTPStatusError). Update the related test_app.py
handling (which currently catches ConnectionError/TimeoutError) to use httpx
exception types so both tests correctly distinguish service availability issues
from real semsearch regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@server/backend/tests/conftest.py`:
- Around line 77-85: Fix the mixed tab/space indentation in the fixture
docstrings in conftest.py by converting all leading tabs to spaces (or vice
versa to match project style) so indentation is consistent; specifically update
the docstrings for the fixture that provides the UserRepository (the function
accepting repos (_RepoBundle) and returning UserRepository) and the other nearby
test fixtures to use only spaces for indentation so Ruff E101 is resolved.

In `@server/backend/tests/db_helpers.py`:
- Line 157: Several parameter docstring lines (e.g., the "unit (KnowledgeUnit):
The knowledge unit with updated fields; must correspond to an existing record in
the database." entry and other long lines at the indicated ranges) exceed the
120-char limit and must be wrapped; edit the docstrings in tests/db_helpers.py
(the docstring containing the "unit (KnowledgeUnit)" parameter and the other
long docstrings at the noted ranges) to reflow text to lines under 120
characters, preserving the existing triple-quote block and parameter
formatting/indentation, breaking long sentences into wrapped lines or separate
sentences as needed, and then run the linter (ruff) to confirm E501 is resolved.
- Around line 194-203: The docstring for get_review_status contains mixed tabs
and spaces causing Ruff E101; fix by converting all leading tab characters to
spaces (use 4-space indentation per PEP8) inside the triple-quoted string for
get_review_status and any other nearby docstrings flagged (the docstring blocks
around the later helper functions referenced in the comment), ensuring
consistent spacing throughout so no tabs remain.

In `@server/backend/tests/test_semsearch.py`:
- Line 35: The docstring line describing KnowledgeUnit is too long (Ruff E501);
split the sentence into multiple shorter lines so each stays within the
project's line-length limit. Edit the docstring that contains "KnowledgeUnit: A
newly created KnowledgeUnit configured for tests (private tier, empty Context,
created_by=\"tester\")." and break it into two or more logical segments (for
example after "private tier," or "empty Context,") while preserving punctuation
and meaning so the linter no longer flags E501.

In `@server/backend/tests/test_sqlite_store.py`:
- Around line 21-25: The docstring for the helper that "Builds a KnowledgeUnit
test fixture for the given domain" in server/backend/tests/test_sqlite_store.py
is too long and triggers Ruff E501; reflow it into multiple shorter lines (wrap
sentences so each line is under the configured line-length limit) and preserve
the content (mention KnowledgeUnit, Insight payload summary="s", detail="d",
action="a", empty Context, tier Tier.PRIVATE, created_by "alice") so the helper
docstring stays descriptive but passes linting.

In `@server/backend/tests/test_store.py`:
- Around line 29-34: Wrap the long docstring line in the "Create an Insight with
sensible default values" docstring (the paragraph that describes Parameters and
the overrides (Any) argument) so no single line exceeds the line-length limit;
break the sentence into multiple lines or sentences (e.g., after "defaults." and
before "Parameters:" or within the "overrides (Any):" description) to satisfy
Ruff E501 while keeping the same wording and meaning.

---

Outside diff comments:
In `@server/backend/pyproject.toml`:
- Around line 43-49: In the tests dependency list (the "tests" table entry)
remove the duplicate "jsonschema>=4.23.0" so the dependency appears only once;
locate the tests = [...] block and delete the redundant "jsonschema>=4.23.0"
string, leaving a single instance to avoid duplicate dependency entries and
unnecessary lock churn.

In `@server/backend/src/cq_server/core/db.py`:
- Around line 18-19: Replace direct imports of the private _ENABLED symbol with
the semsearch public API is_enabled() — e.g., remove "from ..semsearch import
_ENABLED" and instead import and call "is_enabled" where enable checks occur
(use "from ..semsearch import is_enabled as semsearch_is_enabled" or call
semsearch.is_enabled()). Update both db.py (references to _ENABLED and any
conditional logic around semsearch_load) and knowledge.py to call
semsearch_is_enabled() (or semsearch.is_enabled()) before using
semsearch_load/other features so you rely on the public API rather than the
private _ENABLED variable.

In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Line 29: The log statement prints the full _TOKEN_EMBEDDING_URL which may
contain credentials; update the logging in __init__.py to sanitize the URL
before logging by parsing _TOKEN_EMBEDDING_URL (use urllib.parse.urlparse) and
log only the safe parts (hostname and port or reconstructed URL without
username/password) or a redacted form, replacing credentials with placeholders;
change the logger.info call that references _TOKEN_EMBEDDING_URL to use the
sanitized/redacted value instead.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Line 108: Remove the stale TODO comment "# check type of sqlalchemy conns"
from the code near the Connection-typed variable (conn: Connection) in
queries.py; since the type annotation is already present on conn, simply delete
that comment line to avoid misleading TODOs and leave the surrounding code and
the conn: Connection annotation unchanged.
- Around line 53-56: The code currently computes np.average over
results[0]["embeddings"] which can be an empty list and yield NaN; update the
logic in the function that handles the API response (look for variables/results
usage: response.json(), results, request_data, and the return line using
np.average) to explicitly validate that results is non-empty and that
results[0]["embeddings"] is a non-empty list before calling np.average; if
embeddings is empty, raise a RuntimeError (including request_data and the raw
response/results for debugging) so downstream serialization/SQL insertion never
receives NaN.

In `@server/backend/tests/test_semsearch.py`:
- Around line 218-221: Narrow the broad except in the test that calls
semsearch_queries._get_embeddings so it only skips when the embedding/service is
unavailable: catch RuntimeError and httpx request/HTTP status errors instead of
Exception (e.g., catch RuntimeError, httpx.RequestError, httpx.HTTPStatusError).
Update the related test_app.py handling (which currently catches
ConnectionError/TimeoutError) to use httpx exception types so both tests
correctly distinguish service availability issues from real semsearch
regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cdbfde9b-c377-4bcb-99e4-b5e6ff37c277

📥 Commits

Reviewing files that changed from the base of the PR and between 399d7b1 and 20e3360.

⛔ Files ignored due to path filters (1)

server/backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (11)

server/backend/pyproject.toml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py
server/backend/tests/test_sqlite_store.py
server/backend/tests/test_store.py

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (5)

server/backend/tests/test_semsearch.py (1)

129-146: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Install semsearch monkeypatches before the first store.insert() in mocked-path tests.

These tests currently seed data before patching _SEMSEARCH_ENABLED/sem_insert_unit/sem_query, so in semsearch-enabled environments they can execute real embedding calls and become flaky.

💡 Suggested fix pattern

     async def test_query_calls_sem_query_when_semsearch_enabled(...):
-        unit = _make_unit("astronomy", summary="Exoplanet transit photometry")
-        store = _make_store(db_path)
-        await store.insert(unit)
-        await store.set_review_status(unit.id, "approved", "reviewer")
-
         import cq_server.repositories.knowledge as knowledge_mod

         monkeypatch.setattr(knowledge_mod, "_SEMSEARCH_ENABLED", True)
         mock_sem_query = AsyncMock(return_value=[unit])
         monkeypatch.setattr(knowledge_mod, "sem_query", mock_sem_query)
         monkeypatch.setattr(knowledge_mod, "sem_insert_unit", AsyncMock())
+
+        unit = _make_unit("astronomy", summary="Exoplanet transit photometry")
+        store = _make_store(db_path)
+        await store.insert(unit)
+        await store.set_review_status(unit.id, "approved", "reviewer")

Apply the same ordering in the other mocked-path tests in this class.

Also applies to: 153-172, 178-193

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_semsearch.py` around lines 129 - 146, The test
seeds data with store.insert before enabling the semsearch monkeypatches, which
can cause real embedding calls; move the monkeypatch.setattr calls that set
knowledge_mod._SEMSEARCH_ENABLED, knowledge_mod.sem_query, and
knowledge_mod.sem_insert_unit so they occur before the first call to
store.insert (and before any other operations that would trigger semsearch),
e.g. set the monkeypatches at the top of
test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering
to the other mocked-path tests in this file that seed data: the tests around
lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect
during store.insert and subsequent operations.

server/backend/src/cq_server/semsearch/queries.py (3)

73-90: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wire upsert_unit() into the repository update path as well.

This helper is added here, but the changed repository integration only imports insert_unit and combined_query. Edited summaries/details/actions will therefore keep their old embedding indefinitely.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 73 - 90, The
repository update flow currently imports insert_unit and combined_query but
never calls upsert_unit, so edited summaries/details/actions never refresh their
embeddings; update the repository code that performs unit inserts/updates to
also import and await upsert_unit(unit) after the DB write (the same places you
call insert_unit or perform updates to a KnowledgeUnit), ensuring you call the
async upsert_unit function for edits (not just creates), and wrap the call in a
short try/except to log failures without breaking the main transaction.

245-262: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The distance term is rewarding less similar vectors.

vec_distance_cosine is lower for better matches, but combine() adds the normalised distance and then sorts descending. That boosts the farthest units instead of the closest ones.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 245 - 262,
The combine() function currently adds normalized distance so lower (better)
vec_distance_cosine values reduce the combined score, which incorrectly rewards
farther vectors; change the distance contribution to reward smaller distances by
using an inverted term (e.g., 1 - (distance / total_distance) when
total_distance > 0) or subtracting the normalized distance from relevance
instead of adding it; update combine to compute normalized_distance = distance /
total_distance if total_distance > 0 else 0 and then use distance_score = 1 -
normalized_distance (or equivalently flip the sign) when weighting with
distance_weight so that closer vectors (smaller vec_distance_cosine) increase
the final score.

55-62: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Average embeddings across every requested input.

The request sends the whole wordlist, but the response handling only reads results[0]["embeddings"]. Any multi-domain query therefore embeds just the first input and silently drops the rest.

Suggested fix

         results = response.json().get("results")
         if not results:
             raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
-        return np.average(np.array([embedding.get("embedding") for embedding in results[0]["embeddings"]]), axis=0)
+        embeddings = [
+            embedding["embedding"]
+            for result in results
+            for embedding in result.get("embeddings", [])
+            if embedding.get("embedding") is not None
+        ]
+        if not embeddings:
+            raise RuntimeError(f"Embedding API returned no embeddings for input: {request_data}")
+        return np.average(np.array(embeddings), axis=0)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 55 - 62, The
code posts the full wordlist but only reads results[0]["embeddings"], dropping
embeddings for other inputs; update the response handling to collect embeddings
from every result entry (e.g., iterate over results and within each result over
result["embeddings"], extracting embedding.get("embedding")), validate the
combined list is non-empty, and then compute the mean across all collected
embedding vectors with np.mean(axis=0) before returning; refer to AsyncClient,
_TOKEN_EMBEDDING_URL, request_data, response, and results to locate where to
change the comprehension that currently uses results[0]["embeddings"].

server/backend/src/cq_server/repositories/knowledge.py (1)

76-83: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid returning a failure after the main insert has already committed.

_insert_sync() commits the canonical KnowledgeUnit before the semsearch write runs. If the embedding fetch or vector insert fails afterwards, the caller sees an exception even though the unit already exists, so retries can trip duplicate-key errors and the vector index is left behind.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/repositories/knowledge.py` around lines 76 - 83,
The code currently commits the canonical KnowledgeUnit in _insert_sync via await
self._db.run_sync(self._insert_sync, unit) and then runs sem_insert_unit in a
separate transaction, which can leave the DB in an inconsistent state if
semsearch fails; change the flow so sem_insert_unit runs inside the same
transaction as the main insert or at least does not raise after the main commit:
move or call sem_insert_unit from within _insert_sync (or obtain the same
transaction/connection and call sem_insert_unit before committing) so both
operations share the same transactional scope (use the same
connection/transaction object instead of self._db.engine.begin()), or if
co-transactional execution is impossible, catch/log semsearch errors and
suppress exceptions so the caller is not signaled a failure after the unit has
been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and
_SEMSEARCH_ENABLED when implementing the change.

♻️ Duplicate comments (8)

server/backend/tests/test_app.py (2)
268-271: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert result count from data, not envelope keys.

len(results) == 1 validates the response object shape, not the number of returned units.
💡 Suggested fix
-        results = resp.json()
-        assert len(results) == 1
-        result = results["data"][0]
+        results = resp.json()
+        assert len(results["data"]) == 1
+        result = results["data"][0]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/tests/test_app.py` around lines 268 - 271, The test currently
asserts len(results) == 1 which checks the top-level response envelope instead
of the number of returned items; update the assertion to check the length of the
"data" list from resp.json() (i.e., assert the count using results["data"])
before accessing results["data"][0], so the test validates the number of
returned units and then asserts result["domains"] == ["astronomy"].
315-320: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Current check does not validate semantic ranking.

Comparing result["evidence"]["confidence"] between ku1 and ku2 is not a ranking assertion here; both seeded units can share the same default confidence, so this can pass without proving semantic ordering.
💡 Suggested fix
-        assert len(results["data"]) >= 1
+        assert len(results["data"]) >= 2
         result = results["data"][0]
         assert ku1["id"] in result_confidences and ku2["id"] in result_confidences
         assert result["domains"] == ["astronomy"]
-        assert result_confidences[ku1["id"]] >= result_confidences[ku2["id"]]
+        assert result["id"] == ku1["id"]
server/backend/src/cq_server/semsearch/__init__.py (3)
111-114: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Always re-disable extension loading.

If sqlite_vec.load(conn) fails here, the connection stays in enable_load_extension(True) state. Wrap the load/schema path in try/finally so later code cannot load arbitrary extensions on that pooled connection.
Suggested fix
     conn.enable_load_extension(True)
-    sqlite_vec.load(conn)
-    conn.enable_load_extension(False)
-    ensure_schema(conn)
+    try:
+        sqlite_vec.load(conn)
+        ensure_schema(conn)
+    finally:
+        conn.enable_load_extension(False)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 111 - 114,
The connection is left with extension loading enabled if sqlite_vec.load(conn)
raises; wrap the load and ensure_schema calls in a try/finally so
conn.enable_load_extension(False) always runs. Specifically, around the calls to
sqlite_vec.load(conn) and ensure_schema(conn) ensure you call
conn.enable_load_extension(False) in a finally block while keeping
conn.enable_load_extension(True) before the try; this guarantees extension
loading is re-disabled on the connection even if sqlite_vec.load fails.
16-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard SEMSEARCH_EMBEDDING_DIM parsing.

int() still runs at import time, so a non-numeric env value brings the whole server down before semsearch has a chance to stay disabled.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 16 - 17, The
module currently calls int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at
import time (symbol _DIM), which will raise on non-numeric env values even when
semsearch is disabled; change it to safely parse the environment variable (e.g.,
read raw value then try/except ValueError or only parse when _ENABLED is True),
falling back to 768 on parse failure or missing value, and ensure the
fallback/default is used without throwing during import.
31-34: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix the install hint in the warning.

This message still tells operators to install cq-sdk[embedding], but this feature is exposed via the server's semsearch extra, so the recovery path is wrong.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/__init__.py` around lines 31 - 34, The
warning logged in semsearch/__init__.py (the logger.warning call referencing
TOKEN_EMBEDDING_URL) has the wrong recovery hint; update the message text to
tell operators to install the server package with the semsearch extra (e.g.
suggest "pip install cq-server[semsearch]" or the appropriate server package
name with the 'semsearch' extra) instead of "cq-sdk[embedding]"; keep the rest
of the message intact and only replace the install hint.
server/backend/src/cq_server/repositories/knowledge.py (1)
81-83: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Keep semsearch I/O off borrowed DB connections and out of the event-loop thread.

Both branches open a synchronous SQLAlchemy connection inside an async method and then await a helper that performs remote I/O before issuing sync conn.execute() calls. That blocks the event loop on DB work and keeps the pooled connection/transaction pinned across the embedding round-trip.
#!/bin/bash
set -e
rg -n -C2 'with self\._db\.engine\.(begin|connect)\(\) as conn|await sem_(insert_unit|query)\(' server/backend/src/cq_server/repositories/knowledge.py
Expected result: both insert() and query() show direct Engine usage inside async methods, bypassing Database.run_sync(...).

Also applies to: 109-113
server/backend/src/cq_server/semsearch/queries.py (2)
227-233: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the normalised domains for the query embedding too.

The SQL filter uses normalized, but the embedding still uses raw domains; case/whitespace variants can therefore search one concept and filter another.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 227 - 233,
The code normalizes domains via normalize_domains() but still passes the
original domains into _get_embeddings, causing mismatch between embedding and
SQL filter; change the call to _get_embeddings to use the normalized variable
(and propagate that normalized value into any subsequent embedding
serialization/_serialize_embedding and args construction) so both the
query_embedding and the "domains" filter use the same normalized domains (keep
references: normalize_domains, normalized, _get_embeddings,
_serialize_embedding, args).
333-336: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clamp zero ratios before taking log().

A valid 0 value makes ratio == 0 when the field mean is positive, so math.log(ratio) raises. Zero cosine distance is a normal best-match case, so this helper can crash on perfect matches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@server/backend/src/cq_server/semsearch/queries.py` around lines 333 - 336,
The loop computing logit can raise on ratio==0 (perfect match); clamp the
computed ratio to a tiny positive epsilon before calling math.log. In the block
iterating over values (variables: values, mean_val, ratio, logit, invert,
field_logits, field_idx), compute ratio as before but then set clamped_ratio =
max(ratio, 1e-12) (or similar small constant) and use clamped_ratio in the
math.log call (respecting the invert flag) before assigning into
field_logits[field_idx][v].

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci-server-embeddings.yaml:
- Line 7: Update the self-trigger path filters in the GitHub Actions workflow so
they match this workflow's filename: in
.github/workflows/ci-server-embeddings.yaml replace any occurrences of
".github/workflows/ci-server.yaml" with
".github/workflows/ci-server-embeddings.yaml" (these are the path filter entries
that control self-triggering of this workflow).
- Line 25: Replace every tag-pinned GitHub Action reference (e.g., the uses:
actions/checkout@v6 entry and the other uses: lines at the noted locations) with
the corresponding full commit SHA for the action repository; for each uses:
value (actions/checkout, and the other actions referenced at lines called out)
look up the desired release tag’s commit SHA on the action's GitHub repo and
substitute `@vX` or `@main` with the full 40-character commit SHA so all uses:
entries are SHA-pinned.
- Line 25: The actions/checkout usage in the workflow currently omits disabling
credential persistence; update every checkout step that uses
"actions/checkout@v6" (all occurrences) to include a with block setting
persist-credentials: false so the step becomes actions/checkout@v6 with
persist-credentials: false to prevent token exposure to downstream steps or
artifacts.
- Around line 102-104: Start the encoder in background (the
modernbert.x86_64-unknown-linux-gnu.encoderfile serve command) then poll the
TOKEN_EMBEDDING_URL (http://127.0.0.1:8080) with a short loop and timeout to
ensure it returns a successful response (HTTP 200 or health/readiness endpoint)
before invoking make setup-server test-server; fail the job if the service
doesn't become ready within the timeout so tests don't run against an unready
encoder.

---

Outside diff comments:
In `@server/backend/src/cq_server/repositories/knowledge.py`:
- Around line 76-83: The code currently commits the canonical KnowledgeUnit in
_insert_sync via await self._db.run_sync(self._insert_sync, unit) and then runs
sem_insert_unit in a separate transaction, which can leave the DB in an
inconsistent state if semsearch fails; change the flow so sem_insert_unit runs
inside the same transaction as the main insert or at least does not raise after
the main commit: move or call sem_insert_unit from within _insert_sync (or
obtain the same transaction/connection and call sem_insert_unit before
committing) so both operations share the same transactional scope (use the same
connection/transaction object instead of self._db.engine.begin()), or if
co-transactional execution is impossible, catch/log semsearch errors and
suppress exceptions so the caller is not signaled a failure after the unit has
been committed; reference _insert_sync, sem_insert_unit, _db.run_sync and
_SEMSEARCH_ENABLED when implementing the change.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 73-90: The repository update flow currently imports insert_unit
and combined_query but never calls upsert_unit, so edited
summaries/details/actions never refresh their embeddings; update the repository
code that performs unit inserts/updates to also import and await
upsert_unit(unit) after the DB write (the same places you call insert_unit or
perform updates to a KnowledgeUnit), ensuring you call the async upsert_unit
function for edits (not just creates), and wrap the call in a short try/except
to log failures without breaking the main transaction.
- Around line 245-262: The combine() function currently adds normalized distance
so lower (better) vec_distance_cosine values reduce the combined score, which
incorrectly rewards farther vectors; change the distance contribution to reward
smaller distances by using an inverted term (e.g., 1 - (distance /
total_distance) when total_distance > 0) or subtracting the normalized distance
from relevance instead of adding it; update combine to compute
normalized_distance = distance / total_distance if total_distance > 0 else 0 and
then use distance_score = 1 - normalized_distance (or equivalently flip the
sign) when weighting with distance_weight so that closer vectors (smaller
vec_distance_cosine) increase the final score.
- Around line 55-62: The code posts the full wordlist but only reads
results[0]["embeddings"], dropping embeddings for other inputs; update the
response handling to collect embeddings from every result entry (e.g., iterate
over results and within each result over result["embeddings"], extracting
embedding.get("embedding")), validate the combined list is non-empty, and then
compute the mean across all collected embedding vectors with np.mean(axis=0)
before returning; refer to AsyncClient, _TOKEN_EMBEDDING_URL, request_data,
response, and results to locate where to change the comprehension that currently
uses results[0]["embeddings"].

In `@server/backend/tests/test_semsearch.py`:
- Around line 129-146: The test seeds data with store.insert before enabling the
semsearch monkeypatches, which can cause real embedding calls; move the
monkeypatch.setattr calls that set knowledge_mod._SEMSEARCH_ENABLED,
knowledge_mod.sem_query, and knowledge_mod.sem_insert_unit so they occur before
the first call to store.insert (and before any other operations that would
trigger semsearch), e.g. set the monkeypatches at the top of
test_query_calls_sem_query_when_semsearch_enabled (and apply the same reordering
to the other mocked-path tests in this file that seed data: the tests around
lines 153-172 and 178-193) so the mocked sem_query/sem_insert_unit are in effect
during store.insert and subsequent operations.

---

Duplicate comments:
In `@server/backend/src/cq_server/semsearch/__init__.py`:
- Around line 111-114: The connection is left with extension loading enabled if
sqlite_vec.load(conn) raises; wrap the load and ensure_schema calls in a
try/finally so conn.enable_load_extension(False) always runs. Specifically,
around the calls to sqlite_vec.load(conn) and ensure_schema(conn) ensure you
call conn.enable_load_extension(False) in a finally block while keeping
conn.enable_load_extension(True) before the try; this guarantees extension
loading is re-disabled on the connection even if sqlite_vec.load fails.
- Around line 16-17: The module currently calls
int(os.environ.get("SEMSEARCH_EMBEDDING_DIM", 768)) at import time (symbol
_DIM), which will raise on non-numeric env values even when semsearch is
disabled; change it to safely parse the environment variable (e.g., read raw
value then try/except ValueError or only parse when _ENABLED is True), falling
back to 768 on parse failure or missing value, and ensure the fallback/default
is used without throwing during import.
- Around line 31-34: The warning logged in semsearch/__init__.py (the
logger.warning call referencing TOKEN_EMBEDDING_URL) has the wrong recovery
hint; update the message text to tell operators to install the server package
with the semsearch extra (e.g. suggest "pip install cq-server[semsearch]" or the
appropriate server package name with the 'semsearch' extra) instead of
"cq-sdk[embedding]"; keep the rest of the message intact and only replace the
install hint.

In `@server/backend/src/cq_server/semsearch/queries.py`:
- Around line 227-233: The code normalizes domains via normalize_domains() but
still passes the original domains into _get_embeddings, causing mismatch between
embedding and SQL filter; change the call to _get_embeddings to use the
normalized variable (and propagate that normalized value into any subsequent
embedding serialization/_serialize_embedding and args construction) so both the
query_embedding and the "domains" filter use the same normalized domains (keep
references: normalize_domains, normalized, _get_embeddings,
_serialize_embedding, args).
- Around line 333-336: The loop computing logit can raise on ratio==0 (perfect
match); clamp the computed ratio to a tiny positive epsilon before calling
math.log. In the block iterating over values (variables: values, mean_val,
ratio, logit, invert, field_logits, field_idx), compute ratio as before but then
set clamped_ratio = max(ratio, 1e-12) (or similar small constant) and use
clamped_ratio in the math.log call (respecting the invert flag) before assigning
into field_logits[field_idx][v].

In `@server/backend/tests/test_app.py`:
- Around line 268-271: The test currently asserts len(results) == 1 which checks
the top-level response envelope instead of the number of returned items; update
the assertion to check the length of the "data" list from resp.json() (i.e.,
assert the count using results["data"]) before accessing results["data"][0], so
the test validates the number of returned units and then asserts
result["domains"] == ["astronomy"].

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: b2539dd6-dfb1-488f-81df-41535f850133

📥 Commits

Reviewing files that changed from the base of the PR and between 20e3360 and 3cbe64f.

📒 Files selected for processing (11)

.github/workflows/ci-server-embeddings.yaml
server/backend/src/cq_server/core/db.py
server/backend/src/cq_server/repositories/knowledge.py
server/backend/src/cq_server/semsearch/__init__.py
server/backend/src/cq_server/semsearch/queries.py
server/backend/tests/conftest.py
server/backend/tests/db_helpers.py
server/backend/tests/test_app.py
server/backend/tests/test_semsearch.py
server/backend/tests/test_sqlite_store.py
server/backend/tests/test_store.py

coderabbitai · 2026-06-11T15:33:57Z

+    name: Validate schemas
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v6


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Pin all GitHub Actions to full commit SHAs.

All uses: entries are tag-based (@v*) rather than SHA-pinned. This violates the stated policy and weakens supply-chain guarantees for CI execution.

Also applies to: 27-27, 36-36, 38-38, 42-42, 46-46, 63-63, 65-65, 70-70, 74-74, 82-82

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-server-embeddings.yaml at line 25, Replace every tag-pinned GitHub Action reference (e.g., the uses: actions/checkout@v6 entry and the other uses: lines at the noted locations) with the corresponding full commit SHA for the action repository; for each uses: value (actions/checkout, and the other actions referenced at lines called out) look up the desired release tag’s commit SHA on the action's GitHub repo and substitute `@vX` or `@main` with the full 40-character commit SHA so all uses: entries are SHA-pinned.

Source: Linters/SAST tools

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disable checkout credential persistence in all jobs.

Each actions/checkout step should set persist-credentials: false to reduce token exposure risk in downstream steps/artifacts.

Suggested patch pattern

- - uses: actions/checkout@v6 + - uses: actions/checkout@v6 + with: + persist-credentials: false

Also applies to: 36-36, 63-63

🧰 Tools

🪛 zizmor (1.25.2)

[warning] 25-25: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)

[error] 25-25: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/ci-server-embeddings.yaml at line 25, The actions/checkout usage in the workflow currently omits disabling credential persistence; update every checkout step that uses "actions/checkout@v6" (all occurrences) to include a with block setting persist-credentials: false so the step becomes actions/checkout@v6 with persist-credentials: false to prevent token exposure to downstream steps or artifacts.

Source: Linters/SAST tools

javiermtorres · 2026-06-11T17:22:20Z

Thanks @javiermtorres

I had a look at this PR but I think it's got some things that need tweaking and some things that are still missing (from previous feedback. Don't we still need an updated Docker file for cq server to provider encoderfile (and docs)?

Added a GH action to check the encoderfile. There's no need to dockerize it in the actions themselves. The user can later decide the specific deployment.

make lint/make lint-server-backend is also failing.

Every check goes smooth now :)

peteski22

Thanks for sticking with this @javiermtorres; the refactor onto core/db.py and repositories/ is looking good and resolves most of the earlier feedback (the confidence mutation is gone, the combined query now has a LIMIT, the deps are pinned, the env vars are documented, and the conditional imports fixed the ImportError-without-the-extra problem). 🙏🏼

Since the restructure moved everything around, the old inline feedback comments are hard to follow, so here's a single consolidated checklist of what's left. All of it stems from this PR, so I'd like to land it complete rather than split out ... genuine future enhancements can be tracked separately (see the footer).

Correctness/error handling around the embedding call

Domain-only fallback. combined_query now raises on an embedding or SQL error (queries.py:236-241) and the caller doesn't catch it (knowledge.py:109-113), so a transient embedding-service hiccup returns a 500 instead of degrading to the SQL-only path.
Partial failure on insert. insert() commits the unit, then embeds in a separate transaction (knowledge.py:80-83). If embedding fails, the unit is persisted but has no vector row; invisible to semantic query() (which JOINs knowledge_units_vec) with no re-index path, and the caller still gets a 500. The unit and its embedding should land or roll back together. (This also answers the :76-79 FIXME; and that comment has a couple of typos worth fixing while it's touched.)
update() doesn't refresh the embedding. It only calls _update_sync (knowledge.py:123-125); upsert_unit exists but is never wired in, so an edited unit keeps a stale vector. README.md:97 already claims insert and update write an embedding row — so docs and behaviour need reconciling.
Domain normalization mismatch. combined_query embeds the raw domains (queries.py:231) but filters on normalized (:233), unlike query() which embeds normalized. Use the normalized form for both.

Dead code (all introduced in this PR)

build_field_logits, compute_combined_relevance, and queries.query are never called outside tests — the live path uses the inline combine(). Remove them (or wire them in). Note compute_combined_relevance's docstring claims it "keeps the combined score positive," which isn't true (combined *= 1.0 + logit flips sign for logit < -1).
upsert_unit is currently unused, either wire it into update() (above) or remove it.

Tests

The new test_semsearch.py exercises the dead helpers, but the live scoring path (combined_query / combine()) only has the skip-gated e2e test, so it has no coverage in normal CI. Once the dead helpers are removed their tests should go too — please add an embedding-free test that mocks _get_embeddings and asserts the real ranking.

Cleanup (all introduced in this PR)

Remove server/backend/.DS_Store (committed) and add it to .gitignore.
Drop the per-insert logger.info (knowledge.py:75).
Fix the ImportError message at semsearch/__init__.py:34 — it says pip install cq-sdk[embedding], but the package is cq-server and the extra is semsearch (cq-server[semsearch]).
Revert the pytest logging config — log_cli = true now prints DEBUG on every run, which contradicts the "only with -vv" intent from earlier; jsonschema is also pinned twice in the tests group (pyproject.toml:44,48,65-67).

Docs

Turn the encoderfile link into a recommendation: a concrete model whose output dimension matches SEMSEARCH_EMBEDDING_DIM (default 768) and the minimal command to run it, so TOKEN_EMBEDDING_URL has something to point at.
Document (or implement) the backfill story: units inserted while semsearch was disabled have no embedding and won't appear in semantic results; there's no reindex path today.
Add TOKEN_EMBEDDING_URL and SEMSEARCH_EMBEDDING_DIM to the self-hosted env-var table in DEVELOPMENT.md; README.md:71-72 points there as the canonical reference, but they're missing.

Genuine future work: configurable candidate limit and relevance/distance weights (queries.py:258-259) is out of scope here and should be tracked in new issues that are referenced on the PR.

Happy to pair on the error-handling items if that's quicker than another round.

javiermtorres · 2026-06-15T13:16:49Z

@peteski22 Re: domain only fallback. Do you want any error to fall back, or only those directly related to the semantic query machinery? Note that the current query can potentially fail for similar reasons (e.g. db connectivity). Also, a fallback provides not a degraded, but a different service instead, since the ranking will not be respecting embeddings. A feature indicator flag in both requests and responses to mark the fallback to standard query may be needed here.

javiermtorres · 2026-06-15T13:26:25Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

javiermtorres · 2026-06-15T13:29:41Z

@peteski22 Re: domain normalization mismatch. I will probably remove query in favour of combined_query, since domain filtering needs to be done anyway.

javiermtorres · 2026-06-15T13:33:41Z

Re: e2e tests, notice that there's an additional CI action that runs all tests enabling embeddings. I could fetch the embedding encoderfile, start it, run a specific suite, and then finish, but I supposed that you'd prefer having the embedding tests separate. Please let me know which approach you'd prefer. I will check whether there is appropriate coverage mocking the embeddings in the current setup.

peteski22 · 2026-06-15T13:57:34Z

@peteski22 Re: domain only fallback. Do you want any error to fall back, or only those directly related to the semantic query machinery? Note that the current query can potentially fail for similar reasons (e.g. db connectivity). Also, a fallback provides not a degraded, but a different service instead, since the ranking will not be respecting embeddings. A feature indicator flag in both requests and responses to mark the fallback to standard query may be needed here.

on any error to fall back:

I'd say scope the catch to the semantic path; don't blanket-catch/except everything.

on the current query can potentially fail:

Try: semantic -> fallback to standard query if semantic stuff fails ... if all that fails, surface the error

on different service :

Yeah, you're right that swapping things quietly isn't great 🤔 probably just have to log a warning as changing the wire format for things is a bigger change and impacts more things.

peteski22 · 2026-06-15T14:07:21Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

javiermtorres · 2026-06-15T14:31:14Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

Yep but the embeddings are not needed in the case of not-semantic search so the condition would be used in a couple of places because we also want the embeddings to be out of the sql queries. It's... awkward anyway.

peteski22 · 2026-06-15T14:44:44Z

@peteski22 Re: partial failure on insert. The operations are not composable now at this level, since the sql commit is not visible. To avoid modifying _insert_sync, I will probably copy the operations there into the embedding insert.

Do you mean move the logic so you compute the embedding first (when enabled) then do the SQL in a single transaction?

Yep but the embeddings are not needed in the case of not-semantic search so the condition would be used in a couple of places because we also want the embeddings to be out of the sql queries. It's... awkward anyway.

Do you mean like:

# semsearch owns the enabled check:
async def compute_embedding(unit) -> bytes | None # None when disabled
def write_embedding(conn, unit_id, vector) -> None # sync, in-txn

# repository.insert() — no _SEMSEARCH_ENABLED branching:
vector = await semsearch.compute_embedding(unit) # HTTP, before the txn
await run_sync(self._insert_sync, unit, vector) # one txn: unit + domains, and write_embedding if vector

javiermtorres · 2026-06-19T08:26:18Z

Do you mean like:

# semsearch owns the enabled check:
async def compute_embedding(unit) -> bytes | None # None when disabled
def write_embedding(conn, unit_id, vector) -> None # sync, in-txn

# repository.insert() — no _SEMSEARCH_ENABLED branching:
vector = await semsearch.compute_embedding(unit) # HTTP, before the txn
await run_sync(self._insert_sync, unit, vector) # one txn: unit + domains, and write_embedding if vector

Probably. But take into account that this needs to be done general enough so that it works without semantic search, and it is compatible with other kinds of search (BM25 for example), and it may need to be compatible with mechanisms in postgres (pgvector e.g.) since we've already merged #431. So it's not going to be trivial.

javiermtorres · 2026-06-22T07:48:38Z

@copilot check if the comments in #247 (review) are now addressed

peteski22

❤️

javiermtorres · 2026-06-24T16:18:49Z

Thanks for taking the time to review @peteski22 ❤️

Copilot started work on behalf of javiermtorres April 7, 2026 13:53 View session

Copilot AI changed the title ~~Add sqlite-vec plugin~~ Add sqlite-vec plugin to Python SDK store Apr 7, 2026

Copilot finished work on behalf of javiermtorres April 7, 2026 14:14

Copilot started work on behalf of peteski22 April 9, 2026 19:19 View session

Copilot finished work on behalf of peteski22 April 9, 2026 19:26

Copilot AI requested a review from peteski22 April 9, 2026 19:26

peteski22 added do-not-merge labels Apr 9, 2026

javiermtorres changed the title ~~Add sqlite-vec plugin to Python SDK store~~ Add sqlite-vec plugin to Python server store Apr 17, 2026

javiermtorres force-pushed the 21-semantic-similarity-queries branch from 45f715e to 588ae9a Compare April 24, 2026 07:19

peteski22 requested a review from Copilot April 28, 2026 09:53

Copilot started reviewing on behalf of peteski22 April 28, 2026 09:53 View session

Copilot AI reviewed Apr 28, 2026

View reviewed changes

peteski22 mentioned this pull request Apr 28, 2026

Implement SqliteStore against Store protocol via Core helpers #308

Closed

peteski22 added team-api Team API / Docker component backend and removed do-not-merge labels Apr 28, 2026

peteski22 changed the title ~~Add sqlite-vec plugin to Python server store~~ server: semantic similarity queries via sqlite-vec + canonical scoring via cq-schema Apr 29, 2026

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/src/cq_server/store/_sqlite.py Outdated

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/src/cq_server/semsearch.py Outdated

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/tests/test_app.py

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/src/cq_server/semsearch.py Outdated

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/pyproject.toml

peteski22 reviewed Apr 29, 2026

View reviewed changes

Comment thread server/backend/pyproject.toml Outdated

coderabbitai Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread server/backend/tests/test_app.py

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

peteski22 reviewed Jun 15, 2026

View reviewed changes

Copilot started work on behalf of javiermtorres June 22, 2026 07:49 View session

Copilot finished work on behalf of javiermtorres June 22, 2026 07:57

javiermtorres force-pushed the 21-semantic-similarity-queries branch 2 times, most recently from 79fc4ef to 0745ad5 Compare June 22, 2026 08:49

peteski22 approved these changes Jun 24, 2026

View reviewed changes

feat: semantic similarity queries

494f9ec

javiermtorres force-pushed the 21-semantic-similarity-queries branch from e10a53b to 494f9ec Compare June 24, 2026 16:15

peteski22 merged commit f8fc95c into main Jun 24, 2026
13 checks passed

peteski22 deleted the 21-semantic-similarity-queries branch June 24, 2026 16:19

coderabbitai Bot mentioned this pull request Jul 7, 2026

feat(server): implement PostgreSQL backend for Database + repositories #490

Open

Uh oh!

Conversation

javiermtorres commented Apr 7, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Semantic search (#21)

Scoring canonicalisation (#332)

Test plan

Summary by CodeRabbit

Uh oh!

javiermtorres commented Apr 7, 2026

Uh oh!

Copilot AI commented Apr 7, 2026

Uh oh!

peteski22 commented Apr 9, 2026

Uh oh!

Copilot AI commented Apr 9, 2026

Uh oh!

javiermtorres commented Apr 17, 2026

Uh oh!

javiermtorres commented Apr 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

peteski22 Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

javiermtorres commented Apr 7, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

peteski22 left a comment •

edited by javiermtorres

Loading

javiermtorres commented Jun 15, 2026 •

edited

Loading

peteski22 commented Jun 15, 2026 •

edited

Loading

javiermtorres commented Jun 15, 2026 •

edited by peteski22

Loading