Skip to content

[Python] Efficient way to iterate over groups #44676

Open
@MarcoGorelli

Description

@MarcoGorelli

Describe the enhancement requested

In pandas / Polars, I can do:

dict(df.group_by(['a', 'b', 'c']).__iter__())

There doesn't seem to be a built-in way to do this in PyArrow, hence I'm opening this as a feature request

Concretely, if I have

import pyarrow as pa
tbl = pa.table({'a': [1,1,3], 'b': [4, 4, 4], 'c': [1, 3, 2]})

then I'd like a way to end up with

{(3,
  4): pyarrow.Table
 a: int64
 b: int64
 c: int64
 ----
 a: [[3]]
 b: [[4]]
 c: [[2]],
 (1,
  4): pyarrow.Table
 a: int64
 b: int64
 c: int64
 ----
 a: [[1,1]]
 b: [[4,4]]
 c: [[1,3]]}

For context, this would be for use in Narwhals, where we have tried to come up with a workaround, but it does exhibit a noticeable slow-down as the number of grouping keys grows - 3 keys is enough for it to be slower than pandas

Component(s)

Python

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions