Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Efficient way to iterate over groups #44676

Open
MarcoGorelli opened this issue Nov 7, 2024 · 0 comments
Open

[Python] Efficient way to iterate over groups #44676

MarcoGorelli opened this issue Nov 7, 2024 · 0 comments

Comments

@MarcoGorelli
Copy link
Contributor

Describe the enhancement requested

In pandas / Polars, I can do:

dict(df.group_by(['a', 'b', 'c']).__iter__())

There doesn't seem to be a built-in way to do this in PyArrow, hence I'm opening this as a feature request

Concretely, if I have

import pyarrow as pa
tbl = pa.table({'a': [1,1,3], 'b': [4, 4, 4], 'c': [1, 3, 2]})

then I'd like a way to end up with

{(3,
  4): pyarrow.Table
 a: int64
 b: int64
 c: int64
 ----
 a: [[3]]
 b: [[4]]
 c: [[2]],
 (1,
  4): pyarrow.Table
 a: int64
 b: int64
 c: int64
 ----
 a: [[1,1]]
 b: [[4,4]]
 c: [[1,3]]}

For context, this would be for use in Narwhals, where we have tried to come up with a workaround, but it does exhibit a noticeable slow-down as the number of grouping keys grows - 3 keys is enough for it to be slower than pandas

Component(s)

Python

@raulcd raulcd changed the title Efficient way to iterate over groups [Python] Efficient way to iterate over groups Nov 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant