-
-
Notifications
You must be signed in to change notification settings - Fork 27
BUG: fix error in write_dataframe when writing an empty or all-None object column with use_arrow #512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I'm not clear on what the proper fix is going to be in this case. Should we instead raise our own error when writing an empty dataframe with one or more object dtype columns present using arrow, and direct user to the non-arrow interface? Or should we fall back to using non-arrow ourselves when detecting an empty data frame? No benefit of using arrow for this case. |
Yes, I didn't have a clear idea yet either... The thing I was wondering about (as indicated in #513) was if it was a very conscious in It is a good point however that arrow doesn't have a lot of added value if there is no data to be written, so it could be an easy fix to detect the dataframe being empty upfront and disabling use of arrow... |
…w-error-when-an-empty-object-column-is-written-using-arrow
I found an extra, related problem. The same error occurs with object type columns with all None values: these are converted to a null type column as well by I also found a fix that solves both issues: convert all null-type columns to string type. |
This is a conscious choice, yes, AFAIK (although it was already like that before my involvement in pyarrow). For other data types in pandas like int, there is a clear equivalent in Arrow, and so it can be retained even for empty dataframes. But "object" dtype has no equivalent in Arrow, and thus the resulting arrow type always has to be "inferred" from the data when converting from pandas to arrow. However, if the column is empty or all-None, there is no data to infer .. At that point there is no ideal choice, but the benefit of using the "null" type is that it essentially does not make a choice, and the type is then not "viral" (if you would infer it as Now in practice, given that object dtype in pandas is often used as strings, and given that GDAL does not support null and so we have to do some conversion to make it work, I think casting to string as you did now in the PR is indeed the best solution (in practice, if you have an object dtype column, we also convert this to their string representation in the non-arrow write path anyway, so this makes that more consistent between both write paths) |
When a dataframe is being written with an object column without any rows or with only None values in the object column, the object column is converted to an null type arrow column, which is not supported by gdal and leads to an error being thrown.
Fixes #513