-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Remote Store Parquet Upload and Replication Implementation #19898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feature/datafusion
Are you sure you want to change the base?
Remote Store Parquet Upload and Replication Implementation #19898
Conversation
|
❌ Gradle check result for 7c29aef: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
This looks great but I'm worried our feature branch diff is getting way too large right now, and this imo is not required for us to start pulling this back into main from the fb, can this stay on a personal fork, where our feat branch remains only primary writes? |
|
❌ Gradle check result for 9446a1c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 662880b: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for 3387048: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
This PR introduces multi-format data storage capabilities to OpenSearch, enabling the system to handle Parquet files alongside traditional Lucene segments in both local and remote storage environments. This architectural enhancement transforms OpenSearch from a single-format (Lucene-only) system to a format-agnostic storage platform.
Key Features Implemented
Technical Implementation
Core Architecture Changes
1. Format-Aware File Tracking
dataFormatfield enabling format identificationCompositeStoreDirectoryroutes operations to appropriate format-specific directoriesDataSourcePlugininterface extended for format-specific directory and blob container creation2. Remote Storage Multi-Format Support
RemoteSegmentMetadataupdated to preserve format information during serialization3. Replication System Overhaul
Map<String, StoreFileMetadata>toMap<FileMetadata, StoreFileMetadata>for format preservation4. Storage Infrastructure Updates
CompositeStoreDirectorysupport with plugin-based format discoveryRemoteSegmentTransferTrackerto track format-specific upload statisticsParquet-Specific Implementations
1. Directory Management
ParquetDataFormatPluginto supportFormatStoreDirectoryandBlobContainercreationshard_path/parquet/)2. Remote Operations
Related Issues
Resolves #[Issue number to be closed when this PR is merged]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.