Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Creating a table from a sliced struct array drops the slice #44731

Open
joseph-isaacs opened this issue Nov 14, 2024 · 1 comment
Open

Comments

@joseph-isaacs
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

Currently on pyarrow 17.0.0 creating a table from a sliced struct array ignores slice bounds

>>> pa.table(pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(0, 1))
pyarrow.Table
a: int64
----
a: [[0,1,2]]

I expect

a: [[0]]

Component(s)

Python

@raulcd raulcd changed the title Creating a table from a sliced struct array drops the slice [Python] Creating a table from a sliced struct array drops the slice Nov 14, 2024
@raulcd
Copy link
Member

raulcd commented Nov 14, 2024

This is the current behavior on main too. slice computes a zero copy slice of the array by updating length and/or offset where necessary:

>> import pyarrow as pa
>> import nanoarrow as na
>>> original_array = pa.array([{'a': 0}, {'a': 1}, {'a': 2}])
>>> sliced_array = original_array.slice(0,1)
>>> sliced_array
<pyarrow.lib.StructArray object at 0x764b3cf5d2a0>
-- is_valid: all not null
-- child 0 type: int64
  [
    0
  ]
>>> na.array(sliced_array).inspect()
<ArrowArray struct<a: int64>>
- length: 1
- offset: 0
- null_count: 0
- buffers[1]:
  - validity <bool[0 b] >
- dictionary: NULL
- children[1]:
  'a': <ArrowArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int64[24 b] 0 1 2>
    - dictionary: NULL
    - children[0]:
>>> na.array(original_array).inspect()
<ArrowArray struct<a: int64>>
- length: 3
- offset: 0
- null_count: 0
- buffers[1]:
  - validity <bool[0 b] >
- dictionary: NULL
- children[1]:
  'a': <ArrowArray int64>
    - length: 3
    - offset: 0
    - null_count: 0
    - buffers[2]:
      - validity <bool[0 b] >
      - data <int64[24 b] 0 1 2>
    - dictionary: NULL
    - children[0]:

I understand the use case but I am unsure what should be the behavior in order to generate the RecordBatch if we have updated the offset with the slice as an example:

>>> pa.table(pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(1,2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/table.pxi", line 6172, in pyarrow.lib.table
    batch = record_batch(data, schema)
  File "pyarrow/table.pxi", line 5991, in pyarrow.lib.record_batch
    batch = RecordBatch._import_from_c_device_capsule(schema_capsule, array_capsule)
  File "pyarrow/table.pxi", line 4002, in pyarrow.lib.RecordBatch._import_from_c_device_capsule
    batch = GetResultValue(ImportDeviceRecordBatch(c_array, c_schema))
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
    return check_status(status)
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
    raise convert_status(status)
pyarrow.lib.ArrowInvalid: ArrowArray struct has non-zero offset, cannot be imported as RecordBatch
>>> pa.array([{'a': 0}, {'a': 1}, {'a': 2}]).slice(1,2)
<pyarrow.lib.StructArray object at 0x764b3cf5d8a0>
-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    2
  ]
>>>

@jorisvandenbossche @pitrou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants