-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reindex categorical codes from Polars #387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changelog?
Does this need a test (maybe @stanmart can contribute one)? Seems like a subtle bug that's easy to re-introduce.
AFAICT |
Here's a mwe to reproduce the non-consecutive codes without loading/writing data: import polars as pl
with pl.StringCache():
s1 = pl.Series(["beagle", "poodle", "labrador"], dtype=pl.Categorical)
s2 = pl.Series(["labrador", "boxer", "beagle"], dtype=pl.Categorical)
print("s1 categories: ", s1.cat.get_categories().to_numpy())
print("s1 codes: ", s1.to_physical().to_numpy())
print("s2 categories: ", s2.cat.get_categories().to_numpy())
print("s2 codes: ", s2.to_physical().to_numpy()) Output:
It seems that |
It will be an issue, though: _extract_codes_and_categories(s2)
If we reconstruct the categorical from this, it will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle polars.Series.get_categories()
returning category labels such that their codes are not necessarily in an increasing order.
I implemented a different fix, which gets around the string cache by localising the series. It allows us to reconstitute both arrays in Martin's example. (My hunch is that Polars is using some sort of string caching when reading data from disk, which is why I thought that there might be gaps in the first place, but I did naively expect the categories to be in the right order. Thanks for catching that, @stanmart.)
Polars support hasn't made it into any release yet. The existing log entry (under new features) should be enough. |
* Add dependencies to pixi.toml * Pixi-ize pre-commit * Add pixi tasks * Update CI * Fix build dependencies * update lockfile * Fix doctest * Try to fix readthedocs * Use latest pixi on conda-forge * Find some minimum versions * Bump minimum formulaic version * Find minimum numpy version * Make polars a test dependency * Update lockfile * Fix typing issues * Fix benchmarks * Update contributing docs * Make ruff happy * Remove unnecessary pre-commit option from CI * first try * Added deprecation, docstring * replace from_pandas and from_polars * keep sorting * add narwhals to conda recipe * bump minimum narwhals version * added narwhals to setup.py * Changelog * Fix categoricals with non-numpy-or-pandas input * Fix categoricals from numpy/list input * Remove unnecessary import * Merge fix from #387 * Bump minimum narwhals version * Update tests * Remove unnecessary argument * Simplify `_extract_codes_and_categories` * Make the check work with the new changes * Import narwhals' stable v1 API --------- Co-authored-by: Martin Stancsics <[email protected]>
When loading Parquet data, I realised that Polars doesn't necessarily start categorical codes at zero, which causes the kernel to crash in, e.g.,
transpose_matvec_complex
. This PR adds reindexing for safety.