Skip to content

Conversation

@ask-kamal-nayan
Copy link

@ask-kamal-nayan ask-kamal-nayan commented Nov 5, 2025

Description

This PR introduces multi-format data storage capabilities to OpenSearch, enabling the system to handle Parquet files alongside traditional Lucene segments in both local and remote storage environments. This architectural enhancement transforms OpenSearch from a single-format (Lucene-only) system to a format-agnostic storage platform.

Key Features Implemented

  • Format-Aware Storage Architecture: Complete redesign of storage and replication systems to support multiple data formats
  • Parquet Remote Storage: Full integration of Parquet files into OpenSearch's remote store functionality
  • Multi-Format Replication: Enhanced segment replication system that handles format-aware file transfers
  • Plugin-Based Format Discovery: Extensible plugin system for adding new data format support
  • Backward Compatibility: Maintains full compatibility with existing Lucene-based functionality

Technical Implementation

Core Architecture Changes

1. Format-Aware File Tracking

  • Enhanced FileMetadata: Extended to include dataFormat field enabling format identification
  • Multi-Format Directory Routing: New CompositeStoreDirectory routes operations to appropriate format-specific directories
  • Plugin-Based Format Support: DataSourcePlugin interface extended for format-specific directory and blob container creation

2. Remote Storage Multi-Format Support

  • CompositeRemoteDirectory: New remote storage layer with format-specific blob containers
  • Format-Aware Upload/Download: Complete rewrite of segment transfer mechanisms supporting multiple formats
  • Remote Metadata Enhancement: RemoteSegmentMetadata updated to preserve format information during serialization

3. Replication System Overhaul

  • ReplicationCheckpoint Redesign: Migrated from Map<String, StoreFileMetadata> to Map<FileMetadata, StoreFileMetadata> for format preservation
  • CatalogSnapshot Integration: Replaced SegmentInfos-based replication with CatalogSnapshot for better format abstraction
  • Format-Aware Diff Calculation: Enhanced segment diff algorithms to consider both filename and format

4. Storage Infrastructure Updates

  • Store Class Enhancement: Added CompositeStoreDirectory support with plugin-based format discovery
  • IndexShard Integration: Enhanced with composite engine initialization and format-aware operation support
  • Transfer Tracking: Updated RemoteSegmentTransferTracker to track format-specific upload statistics

Parquet-Specific Implementations

1. Directory Management

  • GenericStoreDirectory: New implementation for non-Lucene formats with full IndexInput/OutputStream support
  • Parquet Plugin Enhancement: Extended ParquetDataFormatPlugin to support FormatStoreDirectory and BlobContainer creation
  • Format-Specific Path Handling: Automatic path resolution for Parquet files (shard_path/parquet/)

2. Remote Operations

  • Parquet Blob Containers: Dedicated remote storage containers with format-specific paths
  • Checksum Calculation: Format-appropriate checksum algorithms (CRC32 for Parquet vs Lucene-specific methods)
  • Upload Optimization: Multi-stream upload support for large Parquet files with priority handling

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2025

❌ Gradle check result for 7c29aef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@mch2
Copy link
Member

mch2 commented Nov 5, 2025

This looks great but I'm worried our feature branch diff is getting way too large right now, and this imo is not required for us to start pulling this back into main from the fb, can this stay on a personal fork, where our feat branch remains only primary writes?

@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2025

❌ Gradle check result for 9446a1c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

❌ Gradle check result for 662880b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@ask-kamal-nayan ask-kamal-nayan changed the title Remote Store parquet upload and replicaiton implementation. Remote Store Parquet Upload and Replication Implementation Nov 10, 2025
@github-actions
Copy link
Contributor

❌ Gradle check result for 3387048: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants