The `in` operator in `ResultSetCollection.append` causes problems with numpy arrays #1049

kesmit13 · 2025-02-20T19:18:58Z

Describe your changes

When using the in operator to test for equal results that contain numpy arrays, you will get the following error:

The truth value of an array with more than one element is ambiguous

This is due to the fact that np.array == np.array returns a np.array not a bool. The SingleStoreDB database uses numpy arrays for vector values.

Issue number

None

Checklist before requesting a review

Performed a self-review of my code
Formatted my code with pkgmt format
Added tests (when necessary).
Added docstring documentation and update the changelog (when needed)

📚 Documentation preview 📚: https://jupysql--1049.org.readthedocs.build/en/1049/

edublancas · 2025-02-22T00:36:19Z

src/sql/connection/connection.py

+        for idx in reversed(
+            [
+                i
+                for i, item in enumerate(self._result_sets)
+                if all(starmap(_eq, zip(_results(result), _results(item))))
+            ]
+        ):
+            self._result_sets.pop(idx)


can you add some comments to explain what this is doing?

I'll explain here and let you comment on it first. What this is doing is replacing if result in self._result_sets: self._result_sets.remove(result) which simply removes all items equal to result from the result set list. I'm not actually clear on why this is done, but the in operator requires a bool to be returned by the == operation. That doesn't happen with np.array, pd.Series, etc.

What the new code does is iterates over result and item and compares each item individually using the _eq function which returns a bool for both atomic values and array comparisons. If all of the _eq operations return true, the index of that item is added to the resulting list. The final list of indexes is reversed so that they can be popped from self._result_sets without messing up the indexes of the remaining items.

Just to clarify, this isn't a completely backwards-compatible change. It does require that the _result_sets variable contains only iterable rows of data. I believe this is the only use of this object in the code, except in the unit tests where scalar values are used to test, which is why I also made changes to the unit tests.

Another option which would make this change backwards compatible would be to wrap the original version of the code in a try / except block and only switch to the new version if the exception from the existence of numpy arrays occurs.

edublancas · 2025-02-22T00:36:59Z

src/tests/test_connection.py

+def test_result_set_collection_append_numpy():
+    try:
+        import numpy as np
+
+        a1 = (np.array([1, 2]),)
+        a2 = (np.array([3, 4]),)
+
+        collection = ResultSetCollection()
+        collection.append(a1)
+        collection.append(a2)
+
+        assert len(collection._result_sets) == 2
+        assert collection._result_sets[0] is a1
+        assert collection._result_sets[1] is a2
+
+    except ImportError:
+        pass


let's add numpy as a dev dependency and get rid of the ImportError:

jupysql/setup.py

Line 33 in 7e02910

DEV = [

edublancas · 2025-02-22T00:37:15Z

src/tests/test_connection.py

 def test_result_set_collection_append():
    collection = ResultSetCollection()
-    collection.append(1)
-    collection.append(2)
+    collection.append((1,))
+    collection.append((2,))



let's keep the original test and add a new one

The reason this was changed was because the code in append needs the values to be iterable objects. This is more consistent with what the result objects that get appended from the connection are. I'm not sure there is a case in real-world code where the value will be atomic like the original test.

edublancas · 2025-02-22T00:37:49Z

src/tests/test_connection.py

 def test_result_set_collection_iterate():
    collection = ResultSetCollection()
-    collection.append(1)
-    collection.append(2)
+    collection.append((1,))
+    collection.append((2,))

-    assert list(collection) == [1, 2]
+    assert list(collection) == [(1,), (2,)]


 def test_result_set_collection_is_last():
    collection = ResultSetCollection()
-    first, second = object(), object()
+    first, second = (object(),), (object(),)
    collection.append(first)

    assert len(collection) == 1


same, keep old tests, add new ones

edublancas · 2025-03-10T15:47:09Z

I'm closing this because it's taking too long, I don't fully understand the changes, and because some of my comments are >2 weeks old and have not been addressed. I don't have the resources to keep checking on this

kesmit13 added 2 commits February 20, 2025 12:48

Do not use the in operator; it causes problems with numpy arrays

9344243

Fix formatting; add tests

fe03429

kesmit13 requested a review from edublancas as a code owner February 20, 2025 19:18

kesmit13 added 3 commits February 20, 2025 13:57

Add changelog

c3a81f8

Fix append test

d29916e

Fix iterate test

3346657

kesmit13 changed the title ~~The in operator in ResultSetCollection causes problems with numpy arrays~~ [Draft] The in operator in ResultSetCollection causes problems with numpy arrays Feb 20, 2025

kesmit13 marked this pull request as draft February 21, 2025 15:20

kesmit13 changed the title ~~[Draft] The in operator in ResultSetCollection causes problems with numpy arrays~~ The in operator in ResultSetCollection causes problems with numpy arrays Feb 21, 2025

kesmit13 added 2 commits February 21, 2025 09:42

Fix removal loop

450ee64

Do not iterate over ResultSet directly

7a4e622

kesmit13 marked this pull request as ready for review February 21, 2025 16:20

kesmit13 changed the title ~~The in operator in ResultSetCollection causes problems with numpy arrays~~ The in operator in ResultSetCollection.append causes problems with numpy arrays Feb 21, 2025

edublancas requested changes Feb 22, 2025

View reviewed changes

kesmit13 and others added 2 commits March 10, 2025 09:20

Make numpy test only trigger as needed

00269f9

Merge branch 'ploomber:master' into numpy-support

8e3b089

edublancas closed this Mar 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The `in` operator in `ResultSetCollection.append` causes problems with numpy arrays #1049

The `in` operator in `ResultSetCollection.append` causes problems with numpy arrays #1049

Uh oh!

kesmit13 commented Feb 20, 2025 •

edited

Loading

Uh oh!

edublancas Feb 22, 2025

Uh oh!

kesmit13 Feb 22, 2025

Uh oh!

kesmit13 Feb 26, 2025

Uh oh!

kesmit13 Mar 6, 2025

Uh oh!

edublancas Feb 22, 2025

Uh oh!

edublancas Feb 22, 2025

Uh oh!

kesmit13 Feb 22, 2025

Uh oh!

edublancas Feb 22, 2025

Uh oh!

edublancas commented Mar 10, 2025

Uh oh!

Uh oh!

The in operator in ResultSetCollection.append causes problems with numpy arrays #1049

The in operator in ResultSetCollection.append causes problems with numpy arrays #1049

Uh oh!

Conversation

kesmit13 commented Feb 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue number

Checklist before requesting a review

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edublancas commented Mar 10, 2025

Uh oh!

Uh oh!

The `in` operator in `ResultSetCollection.append` causes problems with numpy arrays #1049

The `in` operator in `ResultSetCollection.append` causes problems with numpy arrays #1049

kesmit13 commented Feb 20, 2025 •

edited

Loading