Add hybrid vector + full-text search#75
Conversation
Extend the search action with optional text search that combines vector and full-text search results using Reciprocal Rank Fusion (hybridRank). - Add textSearch internal query using the searchableText search index - Add getRangesOfChunkIds and getChunkIdsByEmbeddingIds queries - Extract shared buildRanges helper from getRangesOfChunks - Add textSearch, textWeight, vectorWeight options to client API - Default searchableText to chunk content in createChunkArgsBatch
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a selectable search mode (vector|text|hybrid) across UI, client, and server; threads searchType and weighting parameters through RAG search calls; implements text-only, vector-only, and hybrid ranking paths; centralizes range-building; and adds hybrid search tests. Changes
Sequence Diagram(s)sequenceDiagram
participant UI as Client UI
participant ClientLib as Client Library
participant SearchAPI as Search Action
participant TextPath as Text Search Worker
participant VectorPath as Vector Search Worker
participant RRF as Reciprocal Rank Fusion
participant DB as Database
UI->>ClientLib: search(query, searchType, textWeight?, vectorWeight?)
ClientLib->>SearchAPI: convex.action.search(payload with searchType/textQuery/weights)
alt searchType == "text" or "hybrid"
SearchAPI->>TextPath: textSearch(textQuery, filters)
TextPath->>DB: full-text lookup
DB-->>TextPath: text results (chunk ids + scores)
end
alt searchType == "vector" or "hybrid"
SearchAPI->>VectorPath: vectorSearch(embedding/dimension)
VectorPath->>DB: similarity lookup
DB-->>VectorPath: vector results (chunk ids + scores)
end
alt searchType == "hybrid"
VectorPath-->>RRF: vector results
TextPath-->>RRF: text results
RRF->>RRF: merge using weights (textWeight/vectorWeight) -> ranked ids
RRF->>SearchAPI: merged ranked results
else single-path
VectorPath-->>SearchAPI: vector results
TextPath-->>SearchAPI: text results
end
SearchAPI->>DB: buildRanges / getRangesOfChunkIds(final ids)
DB-->>SearchAPI: chunk contents & metadata
SearchAPI-->>ClientLib: final results
ClientLib-->>UI: render results
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pass the textSearch option through all search and askQuestion actions. Add a toggle in the advanced options panel to enable hybrid search.
- Fix typo: "Recriprocal" → "Reciprocal" in hybridRank - Apply user filters (OR semantics) to text search, matching vector search behavior - Document that scores become position-based when hybrid search is enabled - Use proper Id<"chunks"> types instead of string casts in hybrid merge
… path - Add explicit namespaceId filter in text search when user filters are present - Guard against empty merge results in hybrid path - Hoist vector search above the branch point to avoid duplication
Use Doc<"chunks"> instead of inline type in textSearch toResults. Add 9 tests covering textSearch query, namespace scoping, filters, getChunkIdsByEmbeddingIds, getRangesOfChunkIds, hybrid search with textQuery, deduplication, vector-only path, and weight parameters.
commit: |
ianmacartney
left a comment
There was a problem hiding this comment.
Looking really good!
Would love, in addition to the feedback below, for you to try out the pkg-pr-new link and test it with your app to make sure it behaves well for you.
Big ask here is around reducing the number of queries in search, which might affect the factoring of various pieces of code.
My bandwidth is limited, so if I don't get back to you soon, apologies! Hoping to get more folks on the team to help (we're hiring!)
…eries - Replace `textSearch?: boolean` with `searchType?: "vector" | "text" | "hybrid"` to support text-only search mode (no embedding needed) - Combine 3 separate queries in hybrid path into single `textAndRanges` query (vectorSearch remains separate as it requires action context) - Export shared `vSearchType` validator and `SearchType` type from shared.ts - Update example app to use shared types instead of inline unions
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@src/client/index.ts`:
- Around line 394-427: The current handling permits array queries to silently
fall back to vector-only even when searchType === "text", which results in no
text query and empty results; update the logic so that if searchType === "text"
and Array.isArray(args.query) you throw a descriptive error (referencing
searchType and args.query) instead of only warning, while preserving the
existing warning-and-fallback behavior for hybrid/hybrid-compatible cases
(searchType === "hybrid" or vector). Ensure the change is applied around the
existing variables/flow: searchType, needsEmbedding, needsTextQuery, args.query,
the embed() call and textQuery assignment so that embedding still occurs for
array queries when allowed and textQuery remains undefined only when legitimate.
🧹 Nitpick comments (1)
src/component/chunks.ts (1)
314-444: Avoid repeated entry reads inbuildRanges.You already load all entries up front; reusing them avoids extra DB gets per chunk.
♻️ Suggested refactor
- const entries = ( - await Promise.all( - Array.from( - new Set(chunks.filter((c) => c !== null).map((c) => c.entryId)), - ).map((id) => ctx.db.get(id)), - ) - ) - .filter((d) => d !== null) - .map(publicEntry); + const entryDocs = ( + await Promise.all( + Array.from( + new Set(chunks.filter((c) => c !== null).map((c) => c.entryId)), + ).map((id) => ctx.db.get(id)), + ) + ).filter((d): d is Doc<"entries"> => d !== null); + const entries = entryDocs.map(publicEntry); + const entryDocById = new Map(entryDocs.map((d) => [d._id, d])); @@ - const entry = await ctx.db.get(entryId); - assert(entry, `Entry ${entryId} not found`); + const entry = entryDocById.get(entryId); + assert(entry, `Entry ${entryId} not found`);
|
@ianmacartney All requested changes resolved. I've been using the pkg-pr-new link for a few days now and it's been super stable. Could you approve a new build of the package so I can test with the latest changes too please? |
ianmacartney
left a comment
There was a problem hiding this comment.
really close!
Only changes necessary are formatting
| namespace: v.string(), | ||
| embedding: v.array(v.number()), | ||
| embedding: v.optional(v.array(v.number())), | ||
| dimension: v.optional(v.number()), |
There was a problem hiding this comment.
ah yeah this is an interesting nuance - I wonder if finding the namespace should happen higher up and pass in a namespaceId, since doing a query from the parent isn't any more expensive than doing a query first thing in this action.. the encapsulation is nice, but exposing that dimension is part of the namespace identifier is a bit funky. We can keep it like this for now and could in the future have a separate API possible to pass in a namespaceId instead of namespace
- Rename local SearchType to SearchScope in example to avoid confusion with the RAG package's SearchType - Always pass searchType verbatim instead of conditionally omitting it - Throw error when neither embedding nor textQuery is provided - Add explicit vSearchType validator to server search action args - Run prettier formatting
78740e2 to
febe896
Compare
|
@ianmacartney should all be sorted out now 🤞 |
|
FYI I'll be in the backcountry for a few days so may not get to this until next week. |
ianmacartney
left a comment
There was a problem hiding this comment.
Actually I'll just push this out now
|
0.7.1 |
|
thanks again! |
Summary
searchaction with optional full-text search that combines vector and text search results using Reciprocal Rank Fusion (hybridRank)textSearch,textWeight, andvectorWeightoptions to the clientSearchOptionstypesearchableTextto chunk text content so hybrid search works out of the box for new documentsChanges
src/component/search.ts: AddtextSearchinternal query using thesearchableTextsearch index; extendsearchaction with hybrid path (vector + text → RRF merge); scope text search to namespace; apply user filters (OR semantics) to text search matching vector search behavior; guard against empty merge resultssrc/component/chunks.ts: Extract sharedbuildRangeshelper; addgetRangesOfChunkIdsandgetChunkIdsByEmbeddingIdsinternal queriessrc/client/index.ts: AddtextSearch/textWeight/vectorWeighttoSearchOptions; passtextQuerythrough inRAG.search(); defaultsearchableTextincreateChunkArgsBatchsrc/client/hybridRank.ts: Fix typo "Recriprocal" → "Reciprocal"src/component/schema.ts: Remove TODO commentsrc/component/search.test.ts: Add 9 tests covering text search query, namespace scoping, filters, chunk ID lookups, hybrid search with textQuery, deduplication, vector-only path, and weight parametersexample/: Add hybrid search toggle to example app, passingtextSearchoption through search and askQuestion actionsBackwards compatibility
searchableTextwon't appear in text search but still work for vector searchsearchableTextpopulatedscoresemantics change only whentextSearchis enabled (scores become position-based via RRF)Test plan
npm test)textSearchparam)textSearch: truetextWeight/vectorWeightto confirm ranking changesgetChunkIdsByEmbeddingIdsandgetRangesOfChunkIdshelpersSummary by CodeRabbit
New Features
Bug Fixes
Tests
Chores