GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

nealrichardson · 2024-11-05T21:39:27Z

Rationale for this change

Support a missing feature, just wiring up some stuff from R to Acero, then adding docs and tests.

This is mostly picking up where #13934 started and finishing it out. Thanks @mopcup for the initial lift.

What changes are included in this PR?

An aggregation binding, some symbol manipulation, and tests. I also cleaned up some dplyr test shims from 2022.

Are these changes tested?

Yes, though if anyone knows of odd corners in distinct() that aren't covered by this, we can add more

Are there any user-facing changes?

Yes indeed.

GitHub Issue: [R] Support for .keep_all = TRUE with distinct() #29642

github-actions · 2024-11-05T21:39:56Z

⚠️ GitHub issue #29642 has been automatically assigned in GitHub to PR creator.

jonkeane

Thanks for this! Mostly questions about messaging + conveying some of the nuances

jonkeane · 2024-11-10T13:14:35Z

r/tests/testthat/test-dplyr-distinct.R

+    # Drop factor because of #44661:
+    # NotImplemented: Function 'hash_one' has no kernel matching input types
+    #   (dictionary<values=string, indices=int8, ordered=0>, uint8)


Is 110-111 the error that someone would get if they tried distinct(..., .keep_all = TRUE) with a factor in the table/data.frame?

We might want to make that a bit nicer / more grokable for folks who might not have the dictionary -> factor knowledge top of mind

jonkeane · 2024-11-10T13:18:11Z

r/R/dplyr-distinct.R

+    # Note: in regular dplyr, `.keep_all = TRUE` returns the first row's value.
+    # However, Acero's `hash_one` function prefers returning non-null values.
+    # So, you'll get the same shape of data, but the values may differ.


This behavior change is probably either not-impactful, or if folks are relying on it, that is actually a bug in their code. Though it does seem like something we should mention (in docs at least?).

Or maybe with a one-time warning?

nealrichardson added 4 commits November 5, 2024 16:03

Remove dplyr test shims from 2022

7a93029

Bring in logic from apache#13934 and add a basic test

f36da90

A couple more tests

ac69fba

Update doc note and comments

1c470cc

nealrichardson requested review from jonkeane and thisisnic as code owners November 5, 2024 21:39

github-actions bot added Component: R awaiting review Awaiting review labels Nov 5, 2024

💅

151db3a

nealrichardson mentioned this pull request Nov 6, 2024

[C++][Acero] hash_one not implemented for dictionary and other types #44661

Open

Add issue link to factor issue

a89c1fe

jonkeane reviewed Nov 10, 2024

View reviewed changes

github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

nealrichardson commented Nov 5, 2024 •

edited by github-actions bot

Loading

github-actions bot commented Nov 5, 2024

jonkeane left a comment

jonkeane Nov 10, 2024

jonkeane Nov 10, 2024

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

Are you sure you want to change the base?

GH-29642: [R] Support for .keep_all = TRUE with distinct() #44652

Conversation

nealrichardson commented Nov 5, 2024 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Nov 5, 2024

jonkeane left a comment

Choose a reason for hiding this comment

jonkeane Nov 10, 2024

Choose a reason for hiding this comment

jonkeane Nov 10, 2024

Choose a reason for hiding this comment

nealrichardson commented Nov 5, 2024 •

edited by github-actions bot

Loading