Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Force casting dynamic types to to string when using read_json with an explicit schema #45574

Open
kyrre opened this issue Feb 19, 2025 · 0 comments
Labels
Component: Python Type: usage Issue is a user question

Comments

@kyrre
Copy link

kyrre commented Feb 19, 2025

Describe the usage question you have. Please include as many useful details as possible.

We want to use PyArrow for ETL jobs where JSON files are periodically read from Azure Blob Storage and inserted to Delta Lake tables. While the schemas are available, some of the columns have a "dynamic type", e.g., we could have two rows in which the ActivityObjects column have these values:

ActivityObjects -> [{"TargetUser": 1, "OperationType": "NetworkShareCreation"}, ..., ]
ActivityObjects -> [{"MachineId": "05-10-15"}, ..., ]

The way we have dealt with this in Spark is just to treat ActivityObjects as array<string> (or even string) and do any additional parsing at query time.

However, if we try to do the same with PyArrow:

parse_options = pj.ParseOptions(explicit_schema=schema)
events = (
    ibis.memtable(
        pj.read_json(
          jsonl_stream, 
          parse_options=parse_options
        )
     )
)

it throws an exception complaining it encountered a list instead of a string.

Is there way to force this behaviour? As I understand this will eventually be solved by the introduction of VariantType.

Component(s)

Python

@kyrre kyrre added the Type: usage Issue is a user question label Feb 19, 2025
@kyrre kyrre changed the title Force casting dynamic types to to string when using read_json with an explicit schema [Python] Force casting dynamic types to to string when using read_json with an explicit schema Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Python Type: usage Issue is a user question
Projects
None yet
Development

No branches or pull requests

1 participant