-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
As a user I would like the ability to construct my own CachedParquetFileReader. This is so that we can provide our own inner (object reader) with specific index preloads. So for example, with the non-CachedParquetFileReader (for a regular ParquetFileReaderFactory I can do this in the create_reader function):
let store = Arc::clone(&self.object_store);
let mut reader = ParquetObjectReader::new(store, file_meta.object_meta.location)
.with_file_size(file_meta.object_meta.size)
.with_preload_column_index(true)
.with_preload_offset_index(true);
if let Some(hint) = metadata_size_hint {
reader = reader.with_footer_size_hint(hint)
};
Ok(Box::new(reader))
And for CachedParquetReader I would like the ability to do this:
let store = Arc::clone(&self.store);
let mut inner = ParquetObjectReader::new(store, file_meta.object_meta.location.clone())
.with_file_size(file_meta.object_meta.size)
.with_preload_column_index(true)
.with_preload_offset_index(true);
if let Some(hint) = metadata_size_hint {
inner = inner.with_footer_size_hint(hint)
};
Ok(Box::new(CachedParquetFileReader {
file_metrics,
store,
inner,
partitioned_file,
metadata_cache,
metadata_size_hint
}))
But since CachedParquetFileReader has private fields and does not have a new method or constructor, I cannot create my own CachedParquetFileReader with preload index.
Describe the solution you'd like
Adjust the CachedParquetFileReader to add a new method or constructor. i.e., to https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/reader.rs file we can add a method to the CachedParquetReader like so:
impl CachedParquetFileReader {
pub fn new(
file_metrics: ParquetFileMetrics,
store: Arc<dyn ObjectStore>,
inner: ParquetObjectReader,
partitioned_file: PartitionedFile,
metadata_cache: Arc<dyn FileMetadataCache>,
metadata_size_hint: Option<usize>,
) -> Self {
Self {
file_metrics,
store,
inner,
partitioned_file,
metadata_cache,
metadata_size_hint
}
}
}
Other option is to make all fields public but I believe the new method might be cleaner (I don't have context on why some fields are private I'm afraid)
Describe alternatives you've considered
Current workaround is to have our own CachedParquetFileReader with copied AsyncFileReader implementation.
Its not the worst workaround but has us maintaining our own CachedParquetFileReader which is not ideal.
Additional context
I can submit a PR for these changes since its a pretty small change but would like someone to verify if there are any issues with my logic or there are any other concerns with adding a constructor