-
Notifications
You must be signed in to change notification settings - Fork 52
Fix tag OR syntax in hybrid queries and add filter expression tests #452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Add comprehensive test suite (test_filter_expressions.py) covering:
* Tag filters with OR syntax (@field:{tag1|tag2|tag3})
* Numeric ranges (9 variants: inclusive/exclusive bounds, infinity)
* Logical negation, operator precedence, hybrid queries
* 23 test cases validating parser correctness
- Fix tag OR separator: use ‘|' instead of index separator
Signed-off-by: Zvi Schneider <[email protected]>
f4ede02 to
af91639
Compare
| // separator used when the index was created. This allows users to specify | ||
| // multiple tags using the syntax: @field:{tag1|tag2|tag3} | ||
| return indexes::Tag::ParseSearchTags(tag_string, '|'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is correct. It leaves open the situation of parsing a tag that itself contains a pipe character. I think the tag parsing code should be updated to handle escaping of the separator character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@allenss-amazon , Do we have somewhere definition of valid\invalid characters for tags? Can "@country:{USA|GBR|CAN}=>[KNN 5 @embedding $vec]" be a tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to stay aligned with RediSearch. Browsing their documentation, I found the following.
| query_vec = struct.pack('3f', 0.9, 0.1, 0.0) | ||
|
|
||
| # Test hybrid query with tag OR syntax: @country:{USA|GBR|CAN}=>[KNN 5 @embedding $vec] | ||
| result = client.execute_command( | ||
| "FT.SEARCH", "hybrid_idx", | ||
| "@country:{USA|GBR|CAN}=>[KNN 5 @embedding $vec]", | ||
| "PARAMS", "2", "vec", query_vec, | ||
| "RETURN", "1", "country" | ||
| ) | ||
|
|
||
| # Should return up to 3 results (filtered by country tag) | ||
| assert result[0] >= 1 and result[0] <= 3 | ||
|
|
||
| # Verify all results match the country filter | ||
| for i in range(1, len(result), 2): | ||
| key = result[i].decode('utf-8') | ||
| assert key in ["doc:1", "doc:2", "doc:3"], f"Unexpected key {key}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yair pointed out that there are two paths for hybrid queries. I believe we can force the logic to use one way or the other configuring the decision threshold (and there are counters to validate which way it went). Can we update the tests that have hybrid queries to force both paths?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be good to repeat the vector tests with both HNSW and FLAT, again different code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another issue that's been reported is that the order of the results is unexpected, it's possible that we're getting different ordering of the results for vector queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean? I have these 2:
- "@Country:{USA|GBR|CAN}=>[KNN 5 @Embedding $vec]"
- "@Country:{USA} | @Country:{GBR} | @Country:{CAN}=>[KNN 5 @Embedding $vec]"
Anything else?
My assumption was that the higher levels of parsing is done correctly, and the change does not affect those, but if I can extend the testing to check logically composed queries, like query1 | query2 | query3, where query-i is a valid filter by itself, is this your intent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to get alignment in capabilities between the ingestion and query subsystems. Assuming that this PR is merged more or less as-is, I believe we're in a state where a tag is forbidden from containing two characters: | and }. I note that this isn't an ingestion problem as we allow the redefinition of the tag separator character. This means that as long as your tags don't contain all 256 bytes, you can ingest all of them. However, you can't query for a tag that contains a pipe or a right brace -- because the parser has no way to treat these as part of a tag, they are hard-coded as delimiters.
I'm going to open an enhancement issue to specify that the query tag parsing logic be updated to support escaping of these two characters. We might roll the fix for this into the future Unicoding of the query logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@allenss-amazon, Our documentation explicitly specifies the syntax query operator and these have nothing todo with the tag delimiter. I agree we need to have complete query syntax, I don't think we should connect the 2. This is a fix for a bug with respect to current declared usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that we can split this. I've opened a separate issue for the future enhancement to do the escape parsing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is captured in the following issue #454
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
Fixes #449
Summary
Fix tag OR syntax in hybrid queries and add comprehensive filter expression tests.
Changes
Tag Filter Parsing Fix
'|'separator for tag search queries instead of the index-defined separator@field:{tag1|tag2|tag3}syntax in all queries regardless of index configurationNew Test Coverage
Add
test_filter_expressions.pywith 23 test cases covering:@field:{tag1|tag2|tag3})-@field:{value})This ensures the tag OR syntax works correctly across all query patterns documented in COMMANDS.md.