Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The in operator in ResultSetCollection.append causes problems with numpy arrays #1049

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

kesmit13
Copy link

@kesmit13 kesmit13 commented Feb 20, 2025

Describe your changes

When using the in operator to test for equal results that contain numpy arrays, you will get the following error:

The truth value of an array with more than one element is ambiguous

This is due to the fact that np.array == np.array returns a np.array not a bool. The SingleStoreDB database uses numpy arrays for vector values.

Issue number

None

Checklist before requesting a review


📚 Documentation preview 📚: https://jupysql--1049.org.readthedocs.build/en/1049/

@kesmit13 kesmit13 requested a review from edublancas as a code owner February 20, 2025 19:18
@kesmit13 kesmit13 changed the title The in operator in ResultSetCollection causes problems with numpy arrays [Draft] The in operator in ResultSetCollection causes problems with numpy arrays Feb 20, 2025
@kesmit13 kesmit13 marked this pull request as draft February 21, 2025 15:20
@kesmit13 kesmit13 changed the title [Draft] The in operator in ResultSetCollection causes problems with numpy arrays The in operator in ResultSetCollection causes problems with numpy arrays Feb 21, 2025
@kesmit13 kesmit13 marked this pull request as ready for review February 21, 2025 16:20
@kesmit13 kesmit13 changed the title The in operator in ResultSetCollection causes problems with numpy arrays The in operator in ResultSetCollection.append causes problems with numpy arrays Feb 21, 2025
Comment on lines +133 to +140
for idx in reversed(
[
i
for i, item in enumerate(self._result_sets)
if all(starmap(_eq, zip(_results(result), _results(item))))
]
):
self._result_sets.pop(idx)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some comments to explain what this is doing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll explain here and let you comment on it first. What this is doing is replacing if result in self._result_sets: self._result_sets.remove(result) which simply removes all items equal to result from the result set list. I'm not actually clear on why this is done, but the in operator requires a bool to be returned by the == operation. That doesn't happen with np.array, pd.Series, etc.

What the new code does is iterates over result and item and compares each item individually using the _eq function which returns a bool for both atomic values and array comparisons. If all of the _eq operations return true, the index of that item is added to the resulting list. The final list of indexes is reversed so that they can be popped from self._result_sets without messing up the indexes of the remaining items.

Comment on lines +1076 to +1092
def test_result_set_collection_append_numpy():
try:
import numpy as np

a1 = (np.array([1, 2]),)
a2 = (np.array([3, 4]),)

collection = ResultSetCollection()
collection.append(a1)
collection.append(a2)

assert len(collection._result_sets) == 2
assert collection._result_sets[0] is a1
assert collection._result_sets[1] is a2

except ImportError:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add numpy as a dev dependency and get rid of the ImportError:

DEV = [

Comment on lines 1068 to 1072
def test_result_set_collection_append():
collection = ResultSetCollection()
collection.append(1)
collection.append(2)
collection.append((1,))
collection.append((2,))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's keep the original test and add a new one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this was changed was because the code in append needs the values to be iterable objects. This is more consistent with what the result objects that get appended from the connection are. I'm not sure there is a case in real-world code where the value will be atomic like the original test.

Comment on lines 1095 to 1108
def test_result_set_collection_iterate():
collection = ResultSetCollection()
collection.append(1)
collection.append(2)
collection.append((1,))
collection.append((2,))

assert list(collection) == [1, 2]
assert list(collection) == [(1,), (2,)]


def test_result_set_collection_is_last():
collection = ResultSetCollection()
first, second = object(), object()
first, second = (object(),), (object(),)
collection.append(first)

assert len(collection) == 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, keep old tests, add new ones

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants