[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

kyrre · 2025-02-19T13:24:24Z

Describe the usage question you have. Please include as many useful details as possible.

We want to use PyArrow for ETL jobs where JSON files are periodically read from Azure Blob Storage and inserted to Delta Lake tables. While the schemas are available, some of the columns have a "dynamic type", e.g., we could have two rows in which the ActivityObjects column have these values:

ActivityObjects -> [{"TargetUser": 1, "OperationType": "NetworkShareCreation"}, ..., ]
ActivityObjects -> [{"MachineId": "05-10-15"}, ..., ]

The way we have dealt with this in Spark is just to treat ActivityObjects as array<string> (or even string) and do any additional parsing at query time.

However, if we try to do the same with PyArrow:

parse_options = pj.ParseOptions(explicit_schema=schema)
events = (
    ibis.memtable(
        pj.read_json(
          jsonl_stream, 
          parse_options=parse_options
        )
     )
)

it throws an exception complaining it encountered a list instead of a string.

Is there way to force this behaviour? As I understand this will eventually be solved by the introduction of VariantType.

Component(s)

Python

The text was updated successfully, but these errors were encountered:

kyrre added the Type: usage Issue is a user question label Feb 19, 2025

kyrre changed the title ~~Force casting dynamic types to to string when using read_json with an explicit schema~~ [Python] Force casting dynamic types to to string when using read_json with an explicit schema Feb 19, 2025

github-actions bot added the Component: Python label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

kyrre commented Feb 19, 2025 •

edited

Loading

[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

Comments

kyrre commented Feb 19, 2025 • edited Loading

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

kyrre commented Feb 19, 2025 •

edited

Loading