Support eager loading page index parquet metadata #18112

zhuqi-lucas · 2025-10-17T07:23:18Z

Which issue does this PR close?

Closes #18111

Rationale for this change

Currently loading metadata will split to two parts, if we enable page index. It will cause heavy load for s3 request when we store parquet file in s3 and without local cache.

This PR add support to eager loading page index parquet metadata, combined two loading to one which is the original datafusion implementation.

What changes are included in this PR?

Currently loading metadata will split to two parts, if we enable page index. It will cause heavy load for s3 request when we store parquet file in s3 and without local cache.

This PR add support to eager loading page index parquet metadata, combined two loading to one which is the original datafusion implementation.

Are these changes tested?

Yes

Are there any user-facing changes?

No

alamb

Thank you @zhuqi-lucas

Given the policy for when to load the page index is a tradeoff between IO and memory/latency (aka eagerly loading the page index may cause more IO but lower latency) I think a tuning knob is appropriate

However I am not quite sure what effect this change has on the number of object store requests

Using the tooling that @BlakeOrth have been working on for instrumenting datafusion-cli we can see what requests are actually being made

#17207

Here is the trace for main

DataFusion CLI v50.2.0
> \object_store_profiling trace
ObjectStore Profile mode set to Trace
> SELECT COUNT(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' where "SearchPhrase" <> '';
+----------+
| count(*) |
+----------+
| 131559   |
+----------+
1 row(s) fetched.
Elapsed 0.606 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
2025-10-17T08:55:22.202041+00:00 operation=Get duration=0.025994s size=8 range: bytes=174965036-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T08:55:22.228064+00:00 operation=Get duration=0.028127s size=34322 range: bytes=174930714-174965035 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T08:55:22.295696+00:00 operation=Get duration=0.032303s size=15503 range: bytes=5120273-5135775 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T08:55:22.296663+00:00 operation=Get duration=0.060797s size=3895852 range: bytes=145483536-149379387 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T08:55:22.330266+00:00 operation=Get duration=0.040970s size=61815 range: bytes=46392516-46454330 path=hits_compatible/athena_partitioned/hits_1.parquet

Here is the trace for this PR:

ObjectStore Profile mode set to Trace
> set datafusion.execution.parquet.eager_load_page_index = true;
0 row(s) fetched.
Elapsed 0.011 seconds.

Object Store Profiling
> SELECT COUNT(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet' where "SearchPhrase" <> '';
+----------+
| count(*) |
+----------+
| 131559   |
+----------+
1 row(s) fetched.
Elapsed 0.776 seconds.

Object Store Profiling
Instrumented Object Store: instrument_mode: Trace, inner: HttpStore
2025-10-17T09:00:12.866861+00:00 operation=Get duration=0.027121s size=8 range: bytes=174965036-174965043 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T09:00:12.894311+00:00 operation=Get duration=0.030500s size=34322 range: bytes=174930714-174965035 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T09:00:12.997688+00:00 operation=Get duration=0.027598s size=3895852 range: bytes=145483536-149379387 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T09:00:12.997693+00:00 operation=Get duration=0.080185s size=15503 range: bytes=5120273-5135775 path=hits_compatible/athena_partitioned/hits_1.parquet
2025-10-17T09:00:13.095884+00:00 operation=Get duration=0.039606s size=61815 range: bytes=46392516-46454330 path=hits_compatible/athena_partitioned/hits_1.parquet

Summaries:
Get
count: 5
duration min: 0.027121s
duration max: 0.080185s
duration avg: 0.041002s
size min: 8 B
size max: 3895852 B
size avg: 801500 B
size sum: 4007500 B

You can see in both cases that 5 (!!) requests are made.

alamb · 2025-10-17T09:24:46Z

Actually, I forgot that https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet does not have page indexes. I rewrote this file so that it does:

copy 'hits_1.parquet' to 'hits_1_index.parquet';

I was having trouble getting the trace thing to work

zhuqi-lucas · 2025-10-17T09:26:57Z

Actually, I forgot that https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet does not have page indexes. I rewrote this file so that it does:
copy 'hits_1.parquet' to 'hits_1_index.parquet';
I was having trouble getting the trace thing to work

Thank you @alamb , yeah, the hits file here does not have page index.

zhuqi-lucas · 2025-10-18T13:13:03Z

It seems always request 2 times to s3 at lease if we enable page index, i will investigate more:

   /// Attempts to (asynchronously) parse the footer metadata (and optionally page indexes)
    /// given a [`MetadataFetch`].
    ///
    /// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the number of fetches
    /// performed by this function.
    #[cfg(all(feature = "async", feature = "arrow"))]
    pub async fn try_load<F: MetadataFetch>(&mut self, mut fetch: F, file_size: u64) -> Result<()> {
        let (metadata, remainder) = self.load_metadata(&mut fetch, file_size).await?;

        self.metadata = Some(metadata);

        // we can return if page indexes aren't requested
        if self.column_index == PageIndexPolicy::Skip && self.offset_index == PageIndexPolicy::Skip
        {
            return Ok(());
        }

        self.load_page_index_with_remainder(fetch, remainder).await
    }

I am not sure if with_prefetch_hint can make the eager loading page index parquet metadata to happen in one request...

From the description here, it seems will help for page index:

/// Provide a hint as to the number of bytes needed to fully parse the [`ParquetMetaData`].
    /// Only used for the asynchronous [`Self::try_load()`] method.
    ///
    /// By default, the reader will first fetch the last 8 bytes of the input file to obtain the
    /// size of the footer metadata. A second fetch will be performed to obtain the needed bytes.
    /// After parsing the footer metadata, a third fetch will be performed to obtain the bytes
    /// needed to decode the page index structures, if they have been requested. To avoid
    /// unnecessary fetches, `prefetch` can be set to an estimate of the number of bytes needed
    /// to fully decode the [`ParquetMetaData`], which can reduce the number of fetch requests and
    /// reduce latency. Setting `prefetch` too small will not trigger an error, but will result
    /// in extra fetches being performed.
    pub fn with_prefetch_hint(mut self, prefetch: Option<usize>) -> Self {
        self.prefetch_hint = prefetch;
        self
    }

zhuqi-lucas · 2025-10-18T13:26:29Z

More details, it seems datafusion default will not add metadata_size_hint, we can try to add and testing it.

        /// (reading) If specified, the parquet reader will try and fetch the last `size_hint`
        /// bytes of the parquet file optimistically. If not specified, two reads are required:
        /// One read to fetch the 8-byte parquet footer and
        /// another to fetch the metadata length encoded in the footer
        pub metadata_size_hint: Option<usize>, default = None

alamb · 2025-10-18T13:34:25Z

More details, it seems datafusion default will not add metadata_size_hint, we can try to add and testing it.

I filed a potentially related ticket here:

Reduce number of object store requests when reading parquet files by default (set metadata_size_hint) #18118

zhuqi-lucas · 2025-10-18T13:39:09Z

More details, it seems datafusion default will not add metadata_size_hint, we can try to add and testing it.

I filed a potentially related ticket here:

Reduce number of object store requests when reading parquet files by default (set metadata_size_hint) #18118

Got it, thank you @alamb!

alamb · 2025-10-18T13:40:32Z

From what I can tell, to read ParquetMetadata with the default configuration of LIstingTable DataFusion will issue 3 object store requests:

8 bytes for the footer (which has the length of the metadata)
N bytes for the metadata (has offsets to the page index, but not the page index structures itself)
M bytes for the page index structures (which are typically right before the metadata in the file, but not required to be)

The first 8 byte request could be avoiding by changing the default prefetch_hint aka #18118

I think you could potentially avoiding the third request if you extended the prefetch_hint code to use the page index if it was fetched in the initial request

So the flow would be DataFusion makes an initial request for the last prefetch_hint bytes in the file. If that happens to contain enough bytes for metadata and page index no more requests would be made. If additional data was needed additional requests would be made

The newly added Push metadata decoder likely makes this easier to implement:

https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html

(as it will tell you what ranges are needed)

alamb · 2025-10-18T13:41:18Z

I am not sure if with_prefetch_hint can make the eager loading page index parquet metadata to happen in one request...

I agree today that is not how the code works. However, I think it is technically possible and likely not super hard to implement. -- not sure if it is best done in DataFusion or arrow-rs

zhuqi-lucas · 2025-10-18T13:43:14Z

I taken the task #18118, at least we can reduce one.

The newly added Push metadata decoder likely makes this easier to implement:

https://docs.rs/parquet/latest/parquet/file/metadata/struct.ParquetMetaDataPushDecoder.html
(as it will tell you what ranges are needed)

Cool, i will look into this.

zhuqi-lucas · 2025-10-18T13:53:45Z

I am not sure if with_prefetch_hint can make the eager loading page index parquet metadata to happen in one request...

I agree today that is not how the code works. However, I think it is technically possible and likely not super hard to implement. -- not sure if it is best done in DataFusion or arrow-rs

Great, this will be a further improvement i agree, i will look into!

zhuqi-lucas · 2025-10-19T15:47:09Z

This PR is still needed, because currently we split two part of loading metadata if we enable page index, and the second one loading page index will not use reminder, so even we setting prefetch_hint now, we can't reduce the page index request without this PR.

datafusion/datafusion/datasource-parquet/src/opener.rs

Line 705 in f199b00

reader.load_page_index(input).await?;

    /// Asynchronously fetch the page index structures when a [`ParquetMetaData`] has already
    /// been obtained. See [`Self::new_with_metadata()`].
    #[cfg(all(feature = "async", feature = "arrow"))]
    pub async fn load_page_index<F: MetadataFetch>(&mut self, fetch: F) -> Result<()> {
        self.load_page_index_with_remainder(fetch, None).await
    }

alamb · 2025-10-21T10:14:37Z

This PR is still needed, because currently we split two part of loading metadata if we enable page index, and the second one loading page index will not use reminder, so even we setting prefetch_hint now, we can't reduce the page index request without this PR.

As I understand it, your goal is to reduce the number of object store requests when loading parquet metadata (which is a good goal 👍 )

However, I am not sure this PR achieve this goal-- instead what i think this PR does is change when the requests are made to object store to a bit earlier in the processing pipeline, but not the actual number of them.

I think the best way to ensure we minimize the object store requests is:

Use prefetch_hint to make a single request and read a portion of the end of the file
Try and parse both the metadata and page index from the result (if the prefetch was bit enough it will have both structures)

If prefetch fetches enough bytes, this strategy will result in a single object store requests to read all required metadata

I realize setting prefetch today may not actually also parse the page index, but I think that is what we should be working towards (rather than adding another flag, unless there is some need I am missing)

i personally suggest starting with an end to end type test (perhaps in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_config.rs) that illustrates what is happening:

Runs a SQL query from a parquet file
Uses an instrumented object store (e.g. something like https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/file_format/parquet.rs#L304-L303 or https://github.com/apache/datafusion/blob/93f136c06dcb6d4cb362110ae5a4b2b3b8571bb7/datafusion-cli/src/object_storage/instrumented.rs#L253-L252) to verify what requests are made

Then we can configure various prefetch settings and ensure that only the expected number of requests are made

zhuqi-lucas · 2025-10-21T11:00:57Z

This PR is still needed, because currently we split two part of loading metadata if we enable page index, and the second one loading page index will not use reminder, so even we setting prefetch_hint now, we can't reduce the page index request without this PR.

As I understand it, your goal is to reduce the number of object store requests when loading parquet metadata (which is a good goal 👍 )

However, I am not sure this PR achieve this goal-- instead what i think this PR does is change when the requests are made to object store to a bit earlier in the processing pipeline, but not the actual number of them.

I think the best way to ensure we minimize the object store requests is:

Use prefetch_hint to make a single request and read a portion of the end of the file

Try and parse both the metadata and page index from the result (if the prefetch was bit enough it will have both structures)

If prefetch fetches enough bytes, this strategy will result in a single object store requests to read all required metadata

I realize setting prefetch today may not actually also parse the page index, but I think that is what we should be working towards (rather than adding another flag, unless there is some need I am missing)

i personally suggest starting with an end to end type test (perhaps in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_config.rs) that illustrates what is happening:

Runs a SQL query from a parquet file

Uses an instrumented object store (e.g. something like https://github.com/apache/datafusion/blob/f363e382661a4f45dad2912e9988f1703e46939b/datafusion/core/src/datasource/file_format/parquet.rs#L304-L303 or https://github.com/apache/datafusion/blob/93f136c06dcb6d4cb362110ae5a4b2b3b8571bb7/datafusion-cli/src/object_storage/instrumented.rs#L253-L252) to verify what requests are made

Then we can configure various prefetch settings and ensure that only the expected number of requests are made

Thank you @alamb, this is good suggestion. I will try to do this!

What i was meaning is current datafusion will not apply prefetch_hint for page index, because we split the two loading for datafusion, and the second will not pass any prefetch_hint, so the first loading result will not apply to it..., here is the code:

datafusion/datafusion/datasource-parquet/src/opener.rs

Line 705 in f199b00

reader.load_page_index(input).await?;

It pass the None to load_page_index_with_remainder, so the hint will not be used for the second loading.

 /// Asynchronously fetch the page index structures when a [`ParquetMetaData`] has already
    /// been obtained. See [`Self::new_with_metadata()`].
    #[cfg(all(feature = "async", feature = "arrow"))]
    pub async fn load_page_index<F: MetadataFetch>(&mut self, fetch: F) -> Result<()> {
        self.load_page_index_with_remainder(fetch, None).await
    }

Details code, if we don't pass reminder, we will not use already fetched data:

#[cfg(all(feature = "async", feature = "arrow"))]
    async fn load_page_index_with_remainder<F: MetadataFetch>(
        &mut self,
        mut fetch: F,
        remainder: Option<(usize, Bytes)>,
    ) -> Result<()> {
        if self.metadata.is_none() {
            return Err(general_err!("Footer metadata is not present"));
        }

        // Get bounds needed for page indexes (if any are present in the file).
        let range = self.range_for_page_index();
        let range = match range {
            Some(range) => range,
            None => return Ok(()),
        };

        let bytes = match &remainder {
            Some((remainder_start, remainder)) if *remainder_start as u64 <= range.start => {
                let remainder_start = *remainder_start as u64;
                let offset = usize::try_from(range.start - remainder_start)?;
                let end = usize::try_from(range.end - remainder_start)?;
                assert!(end <= remainder.len());
                remainder.slice(offset..end)
            }
            // Note: this will potentially fetch data already in remainder, this keeps things simple
            _ => fetch.fetch(range.start..range.end).await?,
        };

        // Sanity check
        assert_eq!(bytes.len() as u64, range.end - range.start);

        self.parse_column_index(&bytes, range.start)?;
        self.parse_offset_index(&bytes, range.start)?;

        Ok(())
    }

And for the first loading, we will pass the right reminder, which can use prefetch_hint:

 // Begin by loading the metadata from the underlying reader (note
            // the returned metadata may actually include page indexes as some
            // readers may return page indexes even when not requested -- for
            // example when they are cached)
            let mut reader_metadata =
                ArrowReaderMetadata::load_async(&mut async_file_reader, options.clone())
                    .await?;

/// Attempts to (asynchronously) parse the footer metadata (and optionally page indexes)
    /// given a [`MetadataFetch`].
    ///
    /// See [`Self::with_prefetch_hint`] for a discussion of how to reduce the number of fetches
    /// performed by this function.
    #[cfg(all(feature = "async", feature = "arrow"))]
    pub async fn try_load<F: MetadataFetch>(&mut self, mut fetch: F, file_size: u64) -> Result<()> {
        let (metadata, remainder) = self.load_metadata(&mut fetch, file_size).await?;

        self.metadata = Some(metadata);

        // we can return if page indexes aren't requested
        if self.column_index == PageIndexPolicy::Skip && self.offset_index == PageIndexPolicy::Skip
        {
            return Ok(());
        }

        self.load_page_index_with_remainder(fetch, remainder).await
    }

Correct me if am wrong. I am not sure if i am missing something. Thanks!

Support eager loading page index parquet metadata

4781255

zhuqi-lucas requested a review from alamb October 17, 2025 07:23

github-actions bot added common Related to common crate datasource Changes to the datasource crate labels Oct 17, 2025

zhuqi-lucas requested a review from xudong963 October 17, 2025 07:23

zhuqi-lucas and others added 2 commits October 17, 2025 15:23

Merge branch 'main' into issue_18111

0cdb4a4

Fix test

0e93410

github-actions bot added the proto Related to proto crate label Oct 17, 2025

zhuqi-lucas added 4 commits October 17, 2025 15:52

fix proto

2c8ad34

add fix

a9b427c

Add fix

097f050

fix

113266b

alamb reviewed Oct 17, 2025

View reviewed changes

Update doc

a7958d6

github-actions bot added the documentation Improvements or additions to documentation label Oct 17, 2025

zhuqi-lucas and others added 2 commits October 17, 2025 17:46

Merge branch 'main' into issue_18111

06f5516

fix test

f79c688

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Oct 17, 2025

zhuqi-lucas and others added 2 commits October 17, 2025 18:45

fix

d83766a

Merge branch 'main' into issue_18111

4bf157b

zhuqi-lucas added 2 commits October 20, 2025 00:10

Merge branch 'main' into issue_18111

e6bf976

Merge branch 'main' into issue_18111

da83db9

alamb mentioned this pull request Oct 21, 2025

Change default prefetch_hint to 512Kb to reduce number of object store requests when reading parquet files #18160

Open

Support eager loading page index parquet metadata #18112

Are you sure you want to change the base?

Support eager loading page index parquet metadata #18112

Uh oh!

Conversation

zhuqi-lucas commented Oct 17, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 17, 2025

Uh oh!

zhuqi-lucas commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas commented Oct 18, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

zhuqi-lucas commented Oct 18, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

alamb commented Oct 18, 2025

Uh oh!

zhuqi-lucas commented Oct 18, 2025

Uh oh!

zhuqi-lucas commented Oct 18, 2025

Uh oh!

zhuqi-lucas commented Oct 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Oct 21, 2025

Uh oh!

zhuqi-lucas commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuqi-lucas commented Oct 17, 2025 •

edited

Loading

zhuqi-lucas commented Oct 18, 2025 •

edited

Loading

zhuqi-lucas commented Oct 19, 2025 •

edited

Loading

zhuqi-lucas commented Oct 21, 2025 •

edited

Loading