Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data/proprocessors] Support flattening vector features in concatenator #51757

Open
rclough opened this issue Mar 27, 2025 · 3 comments
Open
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@rclough
Copy link

rclough commented Mar 27, 2025

Description

When you use Concatenate in combination with preprocessors that create vector feature columns(such as OneHotEncoder or MultiHotEncoder), the output of Concatenator is not flattened (this is arguably correct behavior, it's not really documented). However, the goal of Concatenator is typically to provide tensor inputs to models, which in many cases is expected to be flat tensors of floats.

Based on offline discussions in the Ray slack, I'd like to propose supporting a flatten flag for the Concatenator that optionally will flatten any vector columns in-place within the output vector. I will follow up soon with an implementation and tests in a PR.

Use case

When using encoder preprocessors that output a vector column, we want to flatten the columns in the final concatenate step for input to the model.

@rclough rclough added enhancement Request for new feature and/or capability triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 27, 2025
@richardliaw richardliaw changed the title [Ray Data: Preprocessors] Support flattening vector features in concatenator [data/proprocessors] Support flattening vector features in concatenator Mar 28, 2025
@jcotant1 jcotant1 added the data Ray Data-related issues label Mar 29, 2025
@martinbomio
Copy link
Contributor

The Concatenator is now erroring out when used with OneHotEncoder encoded column:

00:25:04.629      concatenated = df[self.columns].to_numpy(dtype=self.dtype)
00:25:04.629    File "/home/ray/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 1993, in to_numpy
00:25:04.629      result = self._mgr.as_array(dtype=dtype, copy=copy, na_value=na_value)
00:25:04.629    File "/home/ray/.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 1694, in as_array
00:25:04.629      arr = self._interleave(dtype=dtype, na_value=na_value)
00:25:04.629    File "/home/ray/.venv/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 1753, in _interleave
00:25:04.629      result[rl.indexer] = arr
00:25:04.629  ValueError: setting an array element with a sequence.

@rclough
Copy link
Author

rclough commented Mar 31, 2025

+1, Concatenator cannot handle vector output from encoders, so this should probably be tagged as a bug (cc @jcotant1 ?)

@jcotant1
Copy link
Member

Thanks for the flag, updated this to a bug cc: @richardliaw

@jcotant1 jcotant1 added bug Something that is supposed to be working; but isn't and removed enhancement Request for new feature and/or capability labels Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

3 participants