Skip to content

Conversation

@cj-zhukov
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR is for consolidating all the data_io examples (parquet, catalog, remote_catalog, json_shredding, query_http_csv) into a single example binary. We are agreed on the pattern and we can apply it to the remaining examples

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@cj-zhukov
Copy link
Contributor Author

High-Level Overview

This PR consolidates all data_io (parquet, catalog, remote_catalog, json_shredding, query_http_csv) examples into a single example binary.
Previously, each example had its own file, but now they can be executed via subcommands using:

cargo run --example data_io -- [catalog|json_shredding|parquet_adv_idx|parquet_emb_idx|parquet_enc_with_kms|parquet_enc|parquet_exec_visitor|parquet_idx|query_http_csv|remote_catalog]

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another 1GB of disk space saved when doing a full build

Thank you @cj-zhukov

I ran it locally and it seems to work well

Error: Execution("Unknown example: parquet_emb_index")
(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run --example data_io -- parquet_emb_idx
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.24s
     Running `target/debug/examples/data_io parquet_emb_idx`
Writing custom index at offset: 52, length: 7
Finished writing file to /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/a.parquet
Writing custom index at offset: 52, length: 7
Finished writing file to /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/b.parquet
Writing custom index at offset: 53, length: 8
Finished writing file to /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/c.parquet
Reading index from /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/a.parquet (size: 502)
Reading index at offset: 52, length
Read distinct index for a.parquet: "a.parquet"
Reading index from /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/b.parquet (size: 502)
Reading index at offset: 52, length
Read distinct index for b.parquet: "b.parquet"
Reading index from /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpxB9KyD/c.parquet (size: 506)
Reading index at offset: 53, length
Read distinct index for c.parquet: "c.parquet"
Filtering for category: foo
Scanning only files: ["a.parquet", "c.parquet"]
+----------+
| category |
+----------+
| foo      |
| foo      |
| foo      |
+----------+
(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion$ cargo run --example data_io -- parquet_enc_with_kms
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.20s
     Running `target/debug/examples/data_io parquet_enc_with_kms`
Encrypted Parquet written to /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpt7JlWc/
Reading encrypted Parquet as a RecordBatch stream
Read batch with 4 rows
Finished reading
Encrypted Parquet written to /var/folders/1l/tg68jc6550gg8xqf1hr4mlwr0000gn/T/.tmpbeNqX0/
Reading encrypted Parquet as a RecordBatch stream
Read batch with 4 rows
Finished reading

//! - `query_http_csv` — configure `object_store` and run a query against files via HTTP
//! - `remote_catalog` — interfacing with a remote catalog (e.g. over a network)

mod catalog;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@alamb alamb added this pull request to the merge queue Nov 12, 2025
Merged via the queue into apache:main with commit 8226ebf Nov 12, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants