You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the usage question you have. Please include as many useful details as possible.
We want to use PyArrow for ETL jobs where JSON files are periodically read from Azure Blob Storage and inserted to Delta Lake tables. While the schemas are available, some of the columns have a "dynamic type", e.g., we could have two rows in which the ActivityObjects column have these values:
The way we have dealt with this in Spark is just to treat ActivityObjects as array<string> (or even string) and do any additional parsing at query time.
kyrre
changed the title
Force casting dynamic types to to string when using read_json with an explicit schema
[Python] Force casting dynamic types to to string when using read_json with an explicit schema
Feb 19, 2025
Describe the usage question you have. Please include as many useful details as possible.
We want to use PyArrow for ETL jobs where JSON files are periodically read from Azure Blob Storage and inserted to Delta Lake tables. While the schemas are available, some of the columns have a "dynamic type", e.g., we could have two rows in which the ActivityObjects column have these values:
ActivityObjects -> [{"TargetUser": 1, "OperationType": "NetworkShareCreation"}, ..., ]
ActivityObjects -> [{"MachineId": "05-10-15"}, ..., ]
The way we have dealt with this in Spark is just to treat ActivityObjects as
array<string>
(or evenstring
) and do any additional parsing at query time.However, if we try to do the same with PyArrow:
it throws an exception complaining it encountered a list instead of a string.
Is there way to force this behaviour? As I understand this will eventually be solved by the introduction of VariantType.
Component(s)
Python
The text was updated successfully, but these errors were encountered: