Skip to content

BUG: fix error in write_dataframe when writing an empty or all-None object column with use_arrow #512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Dec 21, 2024

When a dataframe is being written with an object column without any rows or with only None values in the object column, the object column is converted to an null type arrow column, which is not supported by gdal and leads to an error being thrown.

Fixes #513

@brendan-ward brendan-ward changed the title TST: add test to show error when an empty object column is written uing arrow TST: add test to show error when an empty object column is written using arrow Dec 23, 2024
@brendan-ward
Copy link
Member

I'm not clear on what the proper fix is going to be in this case. Should we instead raise our own error when writing an empty dataframe with one or more object dtype columns present using arrow, and direct user to the non-arrow interface? Or should we fall back to using non-arrow ourselves when detecting an empty data frame? No benefit of using arrow for this case.

@theroggy
Copy link
Member Author

theroggy commented Dec 24, 2024

I'm not clear on what the proper fix is going to be in this case. Should we instead raise our own error when writing an empty dataframe with one or more object dtype columns present using arrow, and direct user to the non-arrow interface? Or should we fall back to using non-arrow ourselves when detecting an empty data frame? No benefit of using arrow for this case.

Yes, I didn't have a clear idea yet either... The thing I was wondering about (as indicated in #513) was if it was a very conscious in pyarrow.Table.from_pandas to convert an object column to null datatype, as all other datatypes (int,...) are retained as such. object is obviously a very special case, so I understand it is a different case compared to int,... but the null datatype doesn't seem super useful to me (I might be wrong)...

It is a good point however that arrow doesn't have a lot of added value if there is no data to be written, so it could be an easy fix to detect the dataframe being empty upfront and disabling use of arrow...

…w-error-when-an-empty-object-column-is-written-using-arrow
@theroggy theroggy changed the title TST: add test to show error when an empty object column is written using arrow BUG: fix error when an empty object column is written with use_arrow Jan 23, 2025
@theroggy
Copy link
Member Author

theroggy commented Jan 23, 2025

I found an extra, related problem. The same error occurs with object type columns with all None values: these are converted to a null type column as well by pyarrow.from_pandas.

I also found a fix that solves both issues: convert all null-type columns to string type.

@theroggy theroggy self-assigned this Jan 26, 2025
@theroggy theroggy changed the title BUG: fix error when an empty object column is written with use_arrow BUG: fix error when an empty or all-None object column is written with use_arrow Jan 26, 2025
@theroggy theroggy changed the title BUG: fix error when an empty or all-None object column is written with use_arrow BUG: fix error in write_dataframe when writing an empty or all-None object column with use_arrow Jan 26, 2025
@theroggy theroggy requested a review from brendan-ward January 26, 2025 20:33
@theroggy theroggy added this to the 0.11.0 milestone Apr 10, 2025
@jorisvandenbossche
Copy link
Member

The thing I was wondering about was if it was a very conscious in pyarrow.Table.from_pandas to convert an object column to null datatype, as all other datatypes (int,...) are retained as such. object is obviously a very special case, so I understand it is a different case compared to int,... but the null datatype doesn't seem super useful to me (I might be wrong)...

This is a conscious choice, yes, AFAIK (although it was already like that before my involvement in pyarrow). For other data types in pandas like int, there is a clear equivalent in Arrow, and so it can be retained even for empty dataframes. But "object" dtype has no equivalent in Arrow, and thus the resulting arrow type always has to be "inferred" from the data when converting from pandas to arrow. However, if the column is empty or all-None, there is no data to infer .. At that point there is no ideal choice, but the benefit of using the "null" type is that it essentially does not make a choice, and the type is then not "viral" (if you would infer it as string instead, but it should actually have been something else, then you cannot combine the string column anymore with an int column, while with a null column this can still work).

Now in practice, given that object dtype in pandas is often used as strings, and given that GDAL does not support null and so we have to do some conversion to make it work, I think casting to string as you did now in the PR is indeed the best solution (in practice, if you have an object dtype column, we also convert this to their string representation in the non-arrow write path anyway, so this makes that more consistent between both write paths)

@jorisvandenbossche jorisvandenbossche merged commit 98bb7cd into geopandas:main Apr 27, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: writing an empty dataframe with an object column with use_arrow fails
3 participants