Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add python binding for vectorstore #573

Open
wants to merge 22 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions gel/ai/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,15 @@
from .types import AIOptions, ChatParticipantRole, Prompt, QueryContext
from .core import create_ai, EdgeDBAI
from .core import create_async_ai, AsyncEdgeDBAI
from .vectorstore import (
RecordToInsert,
ItemToInsert,
IdRecord,
Record,
SearchResult,
BaseEmbeddingModel,
GelVectorstore,
)

__all__ = [
"AIOptions",
Expand All @@ -29,4 +38,11 @@
"EdgeDBAI",
"create_async_ai",
"AsyncEdgeDBAI",
"RecordToInsert",
"ItemToInsert",
"IdRecord",
"Record",
"SearchResult",
"BaseEmbeddingModel",
"GelVectorstore",
]
132 changes: 132 additions & 0 deletions gel/ai/metadata_filter.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I don't like this query filter builder API - it's neither easier nor clearer than writing raw EdgeQL segment in a string. And this deviates from the JS-style query builder, at the same time, it's not Pythonic. Why did we want to add a filter API like this in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it looks a bit complex, this is how LamaIndex did it, I think. @deepbuzin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elprans can you pls take a look at this too

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we want to add a filter API like this in the first place?

Same global reason as for building the entire vectorstore - to enable users to use Gel in LLM-related applications and not have to learn EdgeQL (or anything really). It's meant to be dumb and familiar enough to justify using it in place of other vectorstores.

We spent some time discussing this way of implementing filters on a call about a month ago (FWIW I'm not thrilled with it either), and couldn't think of a better way. That's not to say there isn't one, but we had to pick something and move on.

Copy link

@anbuzin anbuzin Feb 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although now that I think about it, maybe raw EdgeQL in our own binding isn’t the worst idea… If someone wants a LlamaIndex-style filter constructor they might as well use the LlamaIndex binding.

But downsides of the raw string are 1. no help from LSP and 2. EdgeQL for filters is not very pretty or concise with all the json fiddling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I see. I've also left some comments in my proposal, but feel free to continue with what the team had agreed on.

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
from typing import Union, List
from enum import Enum


class FilterOperator(str, Enum):
EQ = "="
NE = "!="
GT = ">"
LT = "<"
GTE = ">="
LTE = "<="
IN = "in"
NOT_IN = "not in"
LIKE = "like"
ILIKE = "ilike"
ANY = "any"
ALL = "all"
CONTAINS = "contains"
EXISTS = "exists"


class FilterCondition(str, Enum):
AND = "and"
OR = "or"


class MetadataFilter:
"""Represents a single metadata filter condition."""

def __init__(
self,
key: str,
value: Union[int, float, str],
operator: FilterOperator = FilterOperator.EQ,
):
self.key = key
self.value = value
self.operator = operator

def __repr__(self):
value = f'"{self.value}"' if isinstance(self.value, str) else self.value
return (
f'MetadataFilter(key="{self.key}", '
f"value={value}, "
f'operator="{self.operator.value}")'
)


class CompositeFilter:
"""
Allows grouping multiple MetadataFilter instances using AND/OR conditions.
"""

def __init__(
self,
filters: List[Union["CompositeFilter", MetadataFilter]],
condition: FilterCondition = FilterCondition.AND,
):
self.filters = filters
self.condition = condition

def __repr__(self):
return (
f'CompositeFilter(condition="{self.condition.value}", '
f"filters={self.filters})"
)


def get_filter_clause(filters: CompositeFilter) -> str:
"""
Get the filter clause for a given CompositeFilter.
"""
subclauses = []
for filter in filters.filters:
subclause = ""

if isinstance(filter, CompositeFilter):
subclause = get_filter_clause(filter)
elif isinstance(filter, MetadataFilter):
formatted_value = (
f'"{filter.value}"'
if isinstance(filter.value, str)
else filter.value
)

match filter.operator:
case (
FilterOperator.EQ
| FilterOperator.NE
| FilterOperator.GT
| FilterOperator.GTE
| FilterOperator.LT
| FilterOperator.LTE
| FilterOperator.LIKE
| FilterOperator.ILIKE
):
subclause = (
f'<str>json_get(.metadata, "{filter.key}") '
f"{filter.operator.value} {formatted_value}"
)

case FilterOperator.IN | FilterOperator.NOT_IN:
subclause = (
f'<str>json_get(.metadata, "{filter.key}") '
f"{filter.operator.value} "
f"array_unpack({formatted_value})"
)

case FilterOperator.ANY | FilterOperator.ALL:
subclause = (
f"{filter.operator.value}"
f'(<str>json_get(.metadata, "{filter.key}") = '
f"array_unpack({formatted_value}))"
)

case FilterOperator.CONTAINS | FilterOperator.EXISTS:
subclause = (
f'contains(<str>json_get(.metadata, "{filter.key}"), '
f"{formatted_value})"
)
case _:
raise ValueError(f"Unknown operator: {filter.operator}")

subclauses.append(subclause)

if filters.condition in {FilterCondition.AND, FilterCondition.OR}:
filter_clause = f" {filters.condition.value} ".join(subclauses)
return (
"(" + filter_clause + ")" if len(subclauses) > 1 else filter_clause
)
else:
raise ValueError(f"Unknown condition: {filters.condition}")
Loading