Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading nested struct fields results in an error #213

Open
2 tasks done
dp997 opened this issue Feb 16, 2025 · 0 comments
Open
2 tasks done

Reading nested struct fields results in an error #213

dp997 opened this issue Feb 16, 2025 · 0 comments
Labels
bug Something isn't working feature New feature or request good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user

Comments

@dp997
Copy link

dp997 commented Feb 16, 2025

What happens?

File I'm reading is parquet with a nested struct field. DuckDB reads it fine, but pg_analytics struggles, throwing errors.
Setting up the connection:

create foreign data wrapper parquet_wrapper
handler parquet_fdw_handler
validator parquet_fdw_validator;

create server parquet_server
foreign data wrapper parquet_wrapper;

create user mapping for public
server parquet_server
options (
  type 'S3',
  key_id 'admin',
  secret 'pass1234',
  endpoint 'minio:9000',
  url_style 'path',
  use_ssl 'false'
);

create foreign table parquet_table ()
server parquet_server
options (files 's3://test/test_data.parquet');

and querying the table:

select * from parquet_table;

ends in following error:

ERROR:  Structs with List(Field { name: "l", data_type: Struct([Field { name: "field", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "change", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "reason", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }) field types are not yet supported

parquet_describe() function describes the file correctly (same as duckdb and pyarrow):

select * from parquet_describe('parquet_table');

     column_name     |                                       column_type                                       | null | key | default | extra
---------------------+-----------------------------------------------------------------------------------------+------+-----+---------+-------
 test_data           | STRUCT(sync_id BIGINT, changes STRUCT(field VARCHAR, change VARCHAR, reason VARCHAR)[]) | YES  |     |         |
 ok_field            | VARCHAR                                                                                 | YES  |     |         |
 struct_field        | STRUCT(number BIGINT, "text" VARCHAR)                                                   | YES  |     |         |
 nested_struct_field | STRUCT(nested_struct STRUCT(number BIGINT, "text" VARCHAR))                             | YES  |     |         |
(4 rows)

Dataset reads fine if I exclude such fields:

select "ok_field" from parquet_table;

     ok_field
------------------
 this field is ok
(1 row)

or unnest:

create foreign table parquet_table_2 ()
server parquet_server
options (files 's3://test/test_data.parquet', select 'unnest(test_data)');

select * from parquet_table_2;
 sync_id |                                       changes
---------+-------------------------------------------------------------------------------------
    1234 | [{"field": "foo", "change": "NULLED", "reason": "DESTINATION_SERIALIZATION_ERROR"}]
(1 row)

Same thing happens if I try to reproduce it with a json file.

To Reproduce

Try reading following file from object storage:

{
    "test_data": {
        "sync_id": 1234,
        "changes": [
            {
                "field": "foo",
                "change": "NULLED",
                "reason": "DESTINATION_SERIALIZATION_ERROR"
            }
        ]
    },
    "ok_field": "this field is ok",
    "struct_field": {
        "number": 1,
        "text": "this is ok as well"
    },
    "nested_struct_field": {
        "nested_struct": {
            "number": 2,
            "text": "this field is not ok"
        }
    }
}

"test_data" field is actually a standard field from Airbyte export (column "_airbyte_meta").

OS:

Ubuntu 22.04, x86-64

ParadeDB Version:

v0.15.2 on PostgreSQL 17 (image: paradedb/paradedb:latest)

Are you using ParadeDB Docker, Helm, or the extension(s) standalone?

ParadeDB Docker Image

Full Name:

Dominik Pierzga

Affiliation:

None

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include the code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configurations (e.g., CPU architecture, PostgreSQL version, Linux distribution) to reproduce the issue?

  • Yes, I have
@dp997 dp997 added the bug Something isn't working label Feb 16, 2025
@philippemnoel philippemnoel added feature New feature or request good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user labels Feb 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature New feature or request good first issue Good for newcomers priority-high High priority issue user-request This issue was directly requested by a user
Projects
None yet
Development

No branches or pull requests

2 participants