Skip to content

Add CachedParquetFileReader new method or constructor #18145

@sidd328

Description

@sidd328

Is your feature request related to a problem or challenge?

As a user I would like the ability to construct my own CachedParquetFileReader. This is so that we can provide our own inner (object reader) with specific index preloads. So for example, with the non-CachedParquetFileReader (for a regular ParquetFileReaderFactory I can do this in the create_reader function):

let store = Arc::clone(&self.object_store);
let mut reader = ParquetObjectReader::new(store, file_meta.object_meta.location)
    .with_file_size(file_meta.object_meta.size)
    .with_preload_column_index(true)
    .with_preload_offset_index(true);

if let Some(hint) = metadata_size_hint {
    reader = reader.with_footer_size_hint(hint)
};
Ok(Box::new(reader))

And for CachedParquetReader I would like the ability to do this:

let store = Arc::clone(&self.store);
let mut inner = ParquetObjectReader::new(store, file_meta.object_meta.location.clone())
    .with_file_size(file_meta.object_meta.size)
    .with_preload_column_index(true)
    .with_preload_offset_index(true);

if let Some(hint) = metadata_size_hint {
    inner = inner.with_footer_size_hint(hint)
};

Ok(Box::new(CachedParquetFileReader {
    file_metrics,
    store,
    inner,
    partitioned_file,
    metadata_cache,
    metadata_size_hint
}))

But since CachedParquetFileReader has private fields and does not have a new method or constructor, I cannot create my own CachedParquetFileReader with preload index.

Describe the solution you'd like

Adjust the CachedParquetFileReader to add a new method or constructor. i.e., to https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/reader.rs file we can add a method to the CachedParquetReader like so:

impl CachedParquetFileReader {
    pub fn new(
        file_metrics: ParquetFileMetrics,
        store: Arc<dyn ObjectStore>,
        inner: ParquetObjectReader,
        partitioned_file: PartitionedFile,
        metadata_cache: Arc<dyn FileMetadataCache>,
        metadata_size_hint: Option<usize>,
    ) -> Self {
        Self {
            file_metrics,
            store,
            inner,
            partitioned_file,
            metadata_cache,
            metadata_size_hint
        }
    }
}

Other option is to make all fields public but I believe the new method might be cleaner (I don't have context on why some fields are private I'm afraid)

Describe alternatives you've considered

Current workaround is to have our own CachedParquetFileReader with copied AsyncFileReader implementation.

Its not the worst workaround but has us maintaining our own CachedParquetFileReader which is not ideal.

Additional context

I can submit a PR for these changes since its a pretty small change but would like someone to verify if there are any issues with my logic or there are any other concerns with adding a constructor

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions