Skip to content

Feature/data format aware read only engine#21720

Open
vishwasgarg18 wants to merge 12 commits into
opensearch-project:mainfrom
vishwasgarg18:feature/data-format-aware-read-only-engine
Open

Feature/data format aware read only engine#21720
vishwasgarg18 wants to merge 12 commits into
opensearch-project:mainfrom
vishwasgarg18:feature/data-format-aware-read-only-engine

Conversation

@vishwasgarg18
Copy link
Copy Markdown
Contributor

@vishwasgarg18 vishwasgarg18 commented May 18, 2026


Description

This PR introduces a data-format-aware read-only engine for warm (tiered) primaries in OpenSearch, along with the tiering service infrastructure needed to safely transition indices between hot and warm tiers.

Core Engine: DataFormatAwareReadOnlyEngine

  • Introduces a purpose-built read-only engine for primary shards on warm indices (isWarmIndex()). The engine operates on a CatalogSnapshot, has no IndexWriter, uses NoOpTranslogManager, and rejects all write operations.
  • Reads route through TieredSubdirectoryAwareDirectory to remote (warm) storage.
  • DataFormatAwareIndexerFactory updated to create the read-only engine for warm primaries.
  • Integration tests and unit tests added for the read-only engine.

Tiering Service Improvements

  • TransportPrepareTieringAction: new broadcast action that flushes and syncs all primaries before tiering begins.
  • TransportHotToWarmTierAction: adds read-only block, prepare step, retry logic, and DFA-aware tiering flow.
  • TieringService: sets auto_expand_replicas during the tiering transition.
  • HotToWarmTieringService: sets read_only_allow_delete on cancel to protect segments.
  • WarmToHotTieringService: removes the read-only block when transitioning back to hot.
  • IndexShard: fixes isRemoteSegmentStoreInSync for DFA using CatalogSnapshot.
  • Added already-warm guard to prevent redundant hot-to-warm tiering requests.
  • Added no-op deletion policy to protect warm-tier segment files.

TieredObjectStore: Fix for Warm PPL Queries

  • try_head_from_registry: uses prefix matching to detect directory HEAD requests and returns NotFound (matching LocalFileSystem behavior), allowing DataFusion to fall through to list_with_cache. Removes the fragile path_str.contains('.') heuristic.

Related Issues

Resolves #

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.


@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit ab301fd.

PathLineSeverityDescription
server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java185mediumcheckRequestBlock deliberately returns null, bypassing all index-level cluster block checks. The comment justifies this as intentional (the action runs after a read_only_allow_delete block is set), but unconditionally skipping index-level authorization checks in a transport action is a security control bypass that warrants review — a malicious caller with access to this internal action name could operate on blocked indices.
sandbox/plugins/native-repository-fs/src/main/resources/META-INF/services/org.opensearch.plugins.NativeRemoteObjectStoreProvider9lowNew SPI registration adding FsNativeObjectStorePlugin as a production NativeRemoteObjectStoreProvider. This is a net-new service provider entry in main/resources (not previously in test or main scope), meaning this plugin will be auto-discovered and loaded via Java SPI in all production deployments of this module.
sandbox/plugins/native-repository-azure/src/main/resources/META-INF/services/org.opensearch.plugins.NativeRemoteObjectStoreProvider1lowService provider files for Azure, GCS, and S3 NativeRemoteObjectStoreProviders are moved from internalClusterTest/resources to main/resources. This broadens these SPI registrations from test-only scope to production scope, meaning these providers will now be auto-loaded in production deployments rather than only during integration tests. Same pattern applies to the GCS and S3 files.
server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java107lowaddReadOnlyBlockAndPrepare sets INDEX_BLOCKS_READ_ONLY_ALLOW_DELETE on arbitrary caller-supplied index names via a cluster state update task. Failure paths attempt cleanup but do so asynchronously and best-effort. If cleanup fails (logged as warn), the index is left in a read_only_allow_delete state permanently, which is a potential availability impact vector for an authenticated user with tiering privileges.

The table above displays the top 10 most important findings.

Total: 4 | Critical: 0 | High: 0 | Medium: 1 | Low: 3


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch 8 times, most recently from 2d4cb3c to 0ddec48 Compare May 18, 2026 19:18
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

PR Reviewer Guide 🔍

(Review updated until commit 5165667)

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

The permanentSnapshotRef is acquired in the constructor and released in closeNoLock(). However, if the constructor fails after acquiring the ref but before setting success = true, the ref is never released because closeNoLock() is not called in the finally block. This leaks the snapshot reference. The finally block only closes readerManagersRef, catalogSnapshotManagerRef, and constructingCommitter, but not permanentSnapshotRef.

public DataFormatAwareReadOnlyEngine(EngineConfig engineConfig) {
    this.logger = Loggers.getLogger(DataFormatAwareReadOnlyEngine.class, engineConfig.getShardId());
    assert engineConfig.isReadOnlyReplica() == false : "DataFormatAwareReadOnlyEngine must only be created for primary shards; shard "
        + engineConfig.getShardId();
    this.engineConfig = engineConfig;
    this.shardId = engineConfig.getShardId();
    this.store = engineConfig.getStore();

    store.incRef();
    Map<DataFormat, EngineReaderManager<?>> readerManagersRef = null;
    CatalogSnapshotManager catalogSnapshotManagerRef = null;
    Committer constructingCommitter = null;
    boolean success = false;
    try {
        // Create committer: isReplica=true opens the index without acquiring write.lock.
        // LuceneCommitterFactory detects isWarmIndex + !isReadOnlyReplica and returns
        // ReadOnlyCommitter, which throws on commit() — this engine never writes commits.
        this.committer = constructingCommitter = engineConfig.getCommitterFactory()
            .getCommitter(new CommitterConfig(engineConfig, () -> {}, true));
        Map<String, String> userData = committer.getLastCommittedData();

        // Catalog file deleter
        FileDeleter compositeDeleter = buildFileDeleter();

        List<CatalogSnapshot> committed = committer.listCommittedSnapshots();

        // Build per-format reader managers (same pattern as NRT)
        DataFormatRegistry registry = engineConfig.getDataFormatRegistry();
        Map<String, Supplier<DataFormatDescriptor>> allDescriptors = registry.getFormatDescriptors(engineConfig.getIndexSettings());
        Map<DataFormat, EngineReaderManager<?>> aggregated = new HashMap<>();
        for (String formatName : allDescriptors.keySet()) {
            DataFormat format = registry.format(formatName);
            aggregated.putAll(
                registry.getReaderManager(
                    new ReaderManagerConfig(
                        Optional.of((IndexStoreProvider) dataFormat -> () -> store),
                        format,
                        registry,
                        store.shardPath(),
                        store.getDataformatAwareStoreHandles()
                    )
                )
            );
        }
        readerManagersRef = Map.copyOf(aggregated);
        this.readerManagers = readerManagersRef;

        // Register reader managers as catalog snapshot lifecycle listeners so they are notified
        // (via installSnapshot → afterRefresh) BEFORE latestCatalogSnapshot is swapped. This
        // guarantees readers are updated before the snapshot becomes externally visible.
        List<CatalogSnapshotLifecycleListener> snapshotListeners = new ArrayList<>(readerManagersRef.values());

        catalogSnapshotManagerRef = new CatalogSnapshotManager(
            committed,
            new NoOpCatalogSnapshotDeletionPolicy(),
            compositeDeleter,
            Map.of(),
            snapshotListeners,
            store.shardPath(),
            committer
        );
        this.catalogSnapshotManager = catalogSnapshotManagerRef;

        // Initialize sequence number tracking
        final SequenceNumbers.CommitInfo seqNoInfo = SequenceNumbers.loadSeqNoInfoFromLuceneCommit(userData.entrySet());
        this.localCheckpointTracker = new LocalCheckpointTracker(seqNoInfo.maxSeqNo, seqNoInfo.localCheckpoint);

        // Initialize no-op translog
        String translogUUID = userData.get(Translog.TRANSLOG_UUID_KEY);
        this.translogManager = new NoOpTranslogManager(
            shardId,
            readLock,
            this::ensureOpen,
            new TranslogStats(0, 0, 0, 0, 0),
            Translog.EMPTY_TRANSLOG_SNAPSHOT,
            translogUUID != null ? translogUUID : "readonly-" + shardId,
            true
        );

        // History UUID — fabricate if absent (same as NRT engine) for recovery protocol compliance
        this.historyUUID = userData.get(Engine.HISTORY_UUID_KEY);
        if (this.historyUUID == null) {
            this.historyUUID = UUIDs.randomBase64UUID();
        }

        // Acquire permanent snapshot ref — held for engine's lifetime, released in closeNoLock.
        // The snapshot never changes; all queries share this single ref via the cached reader.
        this.permanentSnapshotRef = catalogSnapshotManagerRef.acquireSnapshot();
        Map<DataFormat, Object> readers = new HashMap<>();
        for (Map.Entry<DataFormat, EngineReaderManager<?>> entry : readerManagersRef.entrySet()) {
            Object r = entry.getValue().getReader(permanentSnapshotRef.get());
            if (r != null) {
                readers.put(entry.getKey(), r);
            }
        }
        this.reader = new DataFormatAwareEngine.DataFormatAwareReader(permanentSnapshotRef, readers);
        success = true;
        logger.info("Created DataFormatAwareReadOnlyEngine");
    } catch (IOException e) {
        throw new EngineCreationFailureException(shardId, "failed to create read-only engine", e);
    } finally {
        if (success == false) {
            if (readerManagersRef != null) {
                IOUtils.closeWhileHandlingException(readerManagersRef.values());
            }
            IOUtils.closeWhileHandlingException(catalogSnapshotManagerRef);
            IOUtils.closeWhileHandlingException(constructingCommitter);
            if (isClosed.get() == false) {
                store.decRef();
            }
        }
    }
Race Condition

In addReadOnlyBlockAndPrepare(), the read-only block is added via a cluster state update task. If the index is deleted between the task submission and execution, indexMetadata will be null at line 109, causing execute() to throw. However, clusterStateProcessed() at line 145 will still be called, attempting to run executePrepareTiering() on a deleted index. This can cause unexpected failures or resource leaks if prepare logic does not handle missing indices gracefully.

private void addReadOnlyBlockAndPrepare(
    IndexTieringRequest request,
    ClusterState state,
    ActionListener<AcknowledgedResponse> listener
) {
    clusterService.submitStateUpdateTask(
        "add-read-only-block-for-tiering [" + request.getIndex() + "]",
        new ClusterStateUpdateTask(Priority.URGENT) {
            @Override
            public ClusterState execute(ClusterState currentState) {
                IndexMetadata indexMetadata = currentState.metadata().index(request.getIndex());
                if (indexMetadata == null) {
                    throw new IllegalStateException("Index [" + request.getIndex() + "] not found");
                }
                Settings.Builder indexSettingsBuilder = Settings.builder()
                    .put(indexMetadata.getSettings())
                    .put(IndexMetadata.INDEX_BLOCKS_READ_ONLY_ALLOW_DELETE_SETTING.getKey(), true);

                IndexMetadata.Builder indexMetadataBuilder = IndexMetadata.builder(indexMetadata)
                    .settings(indexSettingsBuilder)
                    .settingsVersion(1 + indexMetadata.getSettingsVersion());

                Metadata.Builder metadataBuilder = Metadata.builder(currentState.metadata()).put(indexMetadataBuilder);
                ClusterBlocks.Builder blocks = ClusterBlocks.builder().blocks(currentState.blocks());
                blocks.addIndexBlock(request.getIndex(), IndexMetadata.INDEX_READ_ONLY_ALLOW_DELETE_BLOCK);

                return ClusterState.builder(currentState).metadata(metadataBuilder).blocks(blocks).build();
            }

            @Override
            public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
                logger.info("Read-only block added for index [{}], proceeding with pre-tiering sync", request.getIndex());
                executePrepareTiering(request, newState, listener, 1);
            }

            @Override
            public void onFailure(String source, Exception e) {
                logger.error("Failed to add read-only block for index [{}]", request.getIndex());
                listener.onFailure(
                    new IllegalStateException(
                        "Failed to add read-only block for DFA index [" + request.getIndex() + "]. Please retry.",
                        e
                    )
                );
            }
        }
    );
}
Possible Issue

In removeReadOnlyBlock(), if the cluster state update task fails (line 271), the read-only block remains on the index. The warning at line 272 suggests manual removal, but there is no mechanism to retry or alert operators. If this happens during a tiering failure, the index is left in a stuck read-only state, requiring manual intervention. Consider adding a retry mechanism or more prominent alerting.

private void removeReadOnlyBlock(String indexName) {
    clusterService.submitStateUpdateTask(
        "remove-read-only-block-for-tiering [" + indexName + "]",
        new ClusterStateUpdateTask(Priority.URGENT) {
            @Override
            public ClusterState execute(ClusterState currentState) {
                IndexMetadata indexMetadata = currentState.metadata().index(indexName);
                if (indexMetadata == null) {
                    return currentState;
                }
                Settings.Builder indexSettingsBuilder = Settings.builder()
                    .put(indexMetadata.getSettings())
                    .put(IndexMetadata.INDEX_BLOCKS_READ_ONLY_ALLOW_DELETE_SETTING.getKey(), false);

                IndexMetadata.Builder indexMetadataBuilder = IndexMetadata.builder(indexMetadata)
                    .settings(indexSettingsBuilder)
                    .settingsVersion(1 + indexMetadata.getSettingsVersion());

                Metadata.Builder metadataBuilder = Metadata.builder(currentState.metadata()).put(indexMetadataBuilder);
                ClusterBlocks.Builder blocks = ClusterBlocks.builder().blocks(currentState.blocks());
                blocks.removeIndexBlock(indexName, IndexMetadata.INDEX_READ_ONLY_ALLOW_DELETE_BLOCK);

                return ClusterState.builder(currentState).metadata(metadataBuilder).blocks(blocks).build();
            }

            @Override
            public void onFailure(String source, Exception e) {
                logger.warn(
                    "Failed to remove read-only block for index [{}] after tiering failure: {}. "
                        + "Block can be removed manually via index settings.",
                    indexName,
                    e
                );
            }

            @Override
            public void clusterStateProcessed(String source, ClusterState oldState, ClusterState newState) {
                logger.info("Read-only block removed for index [{}] after tiering failure", indexName);
            }
        }
    );
}
Possible Issue

In syncAndFlush(), the code flushes before refreshing (line 154). The comment at line 152 states that flushing first ensures committed segments exist before remote store sync. However, waitForRemoteSync() is called after refresh() (line 164), not after flush(). If refresh() fails between flush and remote sync, the remote store may be out of sync with the local commit, but the method does not detect this. Consider moving the remote sync check immediately after flush or adding error handling for refresh failures.

private void syncAndFlush(IndexShard indexShard, ShardRouting shardRouting) throws IOException {
    logger.trace("Syncing and flushing shard [{}]", shardRouting.shardId());
    indexShard.sync();
    // Flush before refresh: committed segments_N must exist before waitForRemoteStoreSync uploads them.
    // Refreshing first would expose segments not yet committed — the remote store sync would miss them.
    indexShard.flush(new FlushRequest().force(true).waitIfOngoing(true));
    indexShard.refresh("prepare_tiering");
}

@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from 0ddec48 to 7bc2965 Compare May 18, 2026 19:20
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 18, 2026

PR Code Suggestions ✨

Latest suggestions up to 5165667

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Make permit acquisition timeout configurable

The 30-second timeout for acquiring primary permits may be insufficient for shards
with long-running queries or heavy indexing load. If the timeout is exceeded, the
tiering operation fails and must be retried manually. Consider making this timeout
configurable via a cluster setting to allow operators to tune it based on their
workload characteristics.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [56-138]

-private static final TimeValue PERMITS_ACQUIRE_TIMEOUT = TimeValue.timeValueSeconds(30);
-...
-indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
+// Add a cluster setting for permit acquisition timeout
+public static final Setting<TimeValue> PREPARE_TIERING_PERMITS_TIMEOUT_SETTING = Setting.timeSetting(
+    "cluster.tiering.prepare.permits_timeout",
+    TimeValue.timeValueSeconds(30),
+    TimeValue.timeValueSeconds(1),
+    Setting.Property.NodeScope,
+    Setting.Property.Dynamic
+);
Suggestion importance[1-10]: 7

__

Why: Good suggestion to make the 30-second timeout configurable. Different workloads may need different timeouts, and a fixed value could cause unnecessary failures. The suggested cluster setting approach is appropriate and would improve operational flexibility.

Medium
Avoid blocking transport thread with synchronous validation

The preflight validation runs synchronously in the transport thread before the
cluster state update task. If validation is slow (e.g., checking disk space across
many nodes), it blocks the transport thread. Consider moving validation into the
cluster state update task itself or making it asynchronous to avoid blocking the
transport layer.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [93-101]

-try {
-    Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), state);
-    hotToWarmTieringService.preflightValidate(state, index);
-} catch (Exception e) {
-    logger.info("Preflight validation failed for DFA index [{}]: {}", request.getIndex(), e.getMessage());
-    listener.onFailure(e);
-    return;
-}
+// Move validation into the cluster state update task's execute() method
+// to avoid blocking the transport thread on potentially slow I/O operations
+addReadOnlyBlockAndPrepare(request, state, listener);
Suggestion importance[1-10]: 6

__

Why: Valid concern about blocking the transport thread with potentially slow validation. However, the validation is already duplicated inside the cluster state task (line 95 comment mentions "validation also runs inside TieringService.tier()"), so moving it entirely would require careful coordination. The suggestion has merit but needs more design consideration.

Low
Make warm replica count configurable or dynamic

The hardcoded replica count of 1 in auto_expand_replicas may not be appropriate for
all deployments. For clusters with many warm nodes, operators may want more replicas
for better query parallelism and fault tolerance. Consider making the max replica
count configurable via a cluster setting or deriving it from the number of available
warm nodes.

server/src/main/java/org/opensearch/storage/tiering/TieringService.java [534]

-indexSettingsBuilder.put(IndexMetadata.SETTING_AUTO_EXPAND_REPLICAS, "0-" + 1);
+int maxReplicas = Math.min(1, clusterState.nodes().getWarmNodes().size() - 1);
+indexSettingsBuilder.put(IndexMetadata.SETTING_AUTO_EXPAND_REPLICAS, "0-" + maxReplicas);
Suggestion importance[1-10]: 6

__

Why: The hardcoded replica count of 1 may not suit all deployments. However, the suggestion to derive from clusterState.nodes().getWarmNodes().size() could be problematic if warm nodes are added/removed dynamically. A configurable setting would be safer than dynamic calculation. The suggestion has merit but the implementation needs refinement.

Low
Clarify committer configuration for warm primary

The isReplica=true parameter passed to CommitterConfig is misleading for a warm
primary engine. This flag causes LuceneCommitterFactory to return ReadOnlyCommitter,
which is correct, but the semantic intent is unclear. Consider adding a dedicated
isReadOnlyPrimary flag to CommitterConfig to explicitly distinguish warm primaries
from replicas, improving code clarity and maintainability.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareReadOnlyEngine.java [139-140]

 this.committer = constructingCommitter = engineConfig.getCommitterFactory()
-    .getCommitter(new CommitterConfig(engineConfig, () -> {}, true));
+    .getCommitter(new CommitterConfig(engineConfig, () -> {}, false, true)); // isReplica=false, isReadOnlyPrimary=true
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that isReplica=true is semantically misleading for a warm primary. However, adding a new parameter would require changes across multiple classes and the current approach works correctly. The impact is limited to code clarity rather than correctness.

Low

Previous suggestions

Suggestions up to commit 4b088ef
CategorySuggestion                                                                                                                                    Impact
General
Avoid unbounded recursion in retry logic

The retry logic uses unbounded recursion which could cause stack overflow if
MAX_PREPARE_RETRIES is set to a large value. Consider using an iterative approach
with a loop or scheduling retries via the thread pool to avoid stack depth issues.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [165-192]

 private void executePrepareTiering(
     IndexTieringRequest request,
     ClusterState state,
     ActionListener<AcknowledgedResponse> listener,
     int attempt
 ) {
     PrepareTieringRequest prepareTieringRequest = new PrepareTieringRequest(request.getIndex());
     prepareTieringRequest.timeout(request.timeout());
 
     prepareTieringAction.execute(prepareTieringRequest, new ActionListener<BroadcastResponse>() {
         @Override
         public void onResponse(BroadcastResponse broadcastResponse) {
             if (broadcastResponse.getFailedShards() > 0) {
                 if (attempt < MAX_PREPARE_RETRIES) {
                     logger.warn(
                         "Pre-tiering sync attempt [{}/{}] had {} failed shard(s) for index [{}], retrying",
                         attempt,
                         MAX_PREPARE_RETRIES,
                         broadcastResponse.getFailedShards(),
                         request.getIndex()
                     );
-                    executePrepareTiering(request, state, listener, attempt + 1);
+                    // Schedule retry on thread pool to avoid stack overflow
+                    threadPool.executor(ThreadPool.Names.MANAGEMENT).execute(() -> 
+                        executePrepareTiering(request, state, listener, attempt + 1)
+                    );
                     return;
                 }
Suggestion importance[1-10]: 7

__

Why: Valid concern about potential stack overflow with recursive retries. However, with MAX_PREPARE_RETRIES=3, stack depth is minimal. The suggestion to use thread pool scheduling is a good defensive practice for maintainability if the retry count increases.

Medium
Use async pattern for permit acquisition

The actionGet() call blocks the current thread until permits are acquired or timeout
occurs. This can block the transport thread pool. Consider using the async callback
pattern with ActionListener to avoid blocking transport threads during permit
acquisition.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [135-143]

-private Releasable acquirePrimaryPermits(IndexShard indexShard, ShardRouting shardRouting) throws IOException {
-    try {
-        PlainActionFuture<Releasable> permitFuture = new PlainActionFuture<>();
-        indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
-        return permitFuture.actionGet();
-    } catch (Exception e) {
-        throw new IOException("Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e);
-    }
+private void acquirePrimaryPermitsAsync(
+    IndexShard indexShard, 
+    ShardRouting shardRouting,
+    ActionListener<Releasable> listener
+) {
+    indexShard.acquireAllPrimaryOperationsPermits(
+        ActionListener.wrap(
+            listener::onResponse,
+            e -> listener.onFailure(new IOException(
+                "Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e
+            ))
+        ),
+        PERMITS_ACQUIRE_TIMEOUT
+    );
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion correctly identifies that blocking actionGet() can tie up transport threads. However, this code runs in ThreadPool.Names.MANAGEMENT (as specified in the action constructor), which is designed for blocking operations. The async pattern would be better but is not critical here.

Low
Extract hardcoded replica count constant

The hardcoded replica count of 1 in "0-" + 1 should be extracted as a named constant
to improve maintainability and make the tiering replica policy explicit. This makes
it easier to adjust the policy in the future.

server/src/main/java/org/opensearch/storage/tiering/TieringService.java [531-534]

+private static final int TIERING_MAX_REPLICAS = 1;
+
 void updateIndexMetadataForTieringStart(
     final Metadata.Builder metadataBuilder,
     final RoutingTable.Builder routingTableBuilder,
     final IndexMetadata indexMetadata,
     final Index index
 ) {
     try {
         // 1. Build settings
         Settings.Builder indexSettingsBuilder = Settings.builder().put(indexMetadata.getSettings()).put(getTieringStartSettingsToAdd());
 
         // 2. Handle replica updates using auto_expand_replicas
-        indexSettingsBuilder.put(IndexMetadata.SETTING_AUTO_EXPAND_REPLICAS, "0-" + 1);
+        indexSettingsBuilder.put(IndexMetadata.SETTING_AUTO_EXPAND_REPLICAS, "0-" + TIERING_MAX_REPLICAS);
Suggestion importance[1-10]: 5

__

Why: Extracting the hardcoded 1 as a named constant improves maintainability. However, the value is used only once in this context and the inline expression "0-" + 1 is clear enough. The improvement is modest.

Low
Clarify assertion intent with comment

The assertion checks isReadOnlyReplica() == false, but this engine is intended for
warm primaries. Consider renaming the config flag or adding a comment explaining
that "read-only replica" in this context means "replica that receives segments" (not
warm primary). The current assertion message is correct but the flag name is
confusing.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareReadOnlyEngine.java [119-123]

 public DataFormatAwareReadOnlyEngine(EngineConfig engineConfig) {
     this.logger = Loggers.getLogger(DataFormatAwareReadOnlyEngine.class, engineConfig.getShardId());
+    // isReadOnlyReplica() == false ensures this is a primary shard (warm primaries use this engine)
     assert engineConfig.isReadOnlyReplica() == false : "DataFormatAwareReadOnlyEngine must only be created for primary shards; shard "
         + engineConfig.getShardId();
     this.engineConfig = engineConfig;
     this.shardId = engineConfig.getShardId();
     this.store = engineConfig.getStore();
Suggestion importance[1-10]: 4

__

Why: The suggestion adds a clarifying comment about the assertion's intent. While helpful for code readability, the assertion message already explains the constraint clearly. The improvement is marginal.

Low
Suggestions up to commit ff0459c
CategorySuggestion                                                                                                                                    Impact
General
Preserve all request parameters on retry

The retry logic creates a new PrepareTieringRequest on each attempt but only sets
the timeout. If the original request had other settings (e.g., indicesOptions), they
would be lost on retry. Ensure all relevant request parameters are preserved across
retries to maintain consistent behavior.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [176-177]

 private void executePrepareTiering(
     IndexTieringRequest request,
     ClusterState state,
     ActionListener<AcknowledgedResponse> listener,
     int attempt
 ) {
     PrepareTieringRequest prepareTieringRequest = new PrepareTieringRequest(request.getIndex());
     prepareTieringRequest.timeout(request.timeout());
+    prepareTieringRequest.indicesOptions(request.indicesOptions());
Suggestion importance[1-10]: 5

__

Why: Valid point about preserving indicesOptions across retries. However, PrepareTieringRequest extends BroadcastRequest, which has default indicesOptions that are appropriate for internal operations. The original IndexTieringRequest may have user-specified options that should be propagated. This is a minor correctness issue worth addressing.

Low
Avoid blocking transport thread with validation

The preflight validation runs synchronously in the transport thread before any
cluster state mutation. If validation is expensive (e.g., disk space checks across
many nodes), this could block the transport thread. Consider moving the validation
into the cluster state update task or making it asynchronous to avoid blocking the
transport layer.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [96-100]

-try {
-    Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), state);
-    hotToWarmTieringService.preflightValidate(state, index);
-} catch (Exception e) {
-    logger.info("Preflight validation failed for DFA index [{}]: {}", request.getIndex(), e.getMessage());
-    listener.onFailure(e);
-    return;
-}
+// Move validation inside the cluster state update task to avoid blocking transport thread
+clusterService.submitStateUpdateTask(
+    "validate-and-add-read-only-block [" + request.getIndex() + "]",
+    new ClusterStateUpdateTask(Priority.URGENT) {
+        @Override
+        public ClusterState execute(ClusterState currentState) {
+            Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), currentState);
+            hotToWarmTieringService.preflightValidate(currentState, index);
+            // ... rest of block addition logic
+        }
+        // ... onFailure, clusterStateProcessed
+    }
+);
Suggestion importance[1-10]: 4

__

Why: The suggestion raises a valid concern about blocking the transport thread, but the impact is moderate. The preflight validation is designed to be quick (checking cluster state for warm node capacity), not expensive I/O. Moving it into the cluster state task would complicate error handling and lose the early-rejection benefit. The current design is intentional.

Low
Use async pattern for permit acquisition

The actionGet() call blocks the thread until permits are acquired or timeout occurs.
If multiple shards are being prepared in parallel on the same node, this could lead
to thread pool exhaustion. Consider using the async callback pattern instead of
blocking to allow the thread to handle other work while waiting for permits.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [135-143]

-private Releasable acquirePrimaryPermits(IndexShard indexShard, ShardRouting shardRouting) throws IOException {
-    try {
-        PlainActionFuture<Releasable> permitFuture = new PlainActionFuture<>();
-        indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
-        return permitFuture.actionGet();
-    } catch (Exception e) {
-        throw new IOException("Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e);
-    }
+private void acquirePrimaryPermitsAsync(IndexShard indexShard, ShardRouting shardRouting, ActionListener<Releasable> listener) {
+    indexShard.acquireAllPrimaryOperationsPermits(
+        ActionListener.wrap(
+            listener::onResponse,
+            e -> listener.onFailure(new IOException("Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e))
+        ),
+        PERMITS_ACQUIRE_TIMEOUT
+    );
 }
Suggestion importance[1-10]: 3

__

Why: The suggestion correctly identifies a blocking call, but the context matters. This code runs in a broadcast shard operation handler, which already expects blocking I/O (flush, sync). The 30s timeout is generous and thread pool exhaustion is unlikely given the limited concurrency of tiering operations. Refactoring to async would add complexity without clear benefit.

Low
Clarify primary shard assertion logic

The assertion checks isReadOnlyReplica() == false but the engine is named
"ReadOnly". This is confusing because the flag name suggests replica status, but the
check is actually verifying primary status. The assertion message clarifies intent,
but the flag name creates cognitive dissonance. Consider renaming the flag or adding
a helper method like isPrimary() to make the intent clearer at the call site.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareReadOnlyEngine.java [121-122]

 public DataFormatAwareReadOnlyEngine(EngineConfig engineConfig) {
     this.logger = Loggers.getLogger(DataFormatAwareReadOnlyEngine.class, engineConfig.getShardId());
-    assert engineConfig.isReadOnlyReplica() == false
+    assert engineConfig.isPrimary()
         : "DataFormatAwareReadOnlyEngine must only be created for primary shards; shard " + engineConfig.getShardId();
Suggestion importance[1-10]: 2

__

Why: The suggestion points out a naming confusion, but the assertion is correct and the message is clear. The flag isReadOnlyReplica is an existing EngineConfig field used throughout the codebase. Adding a helper method isPrimary() would require changes to EngineConfig and is out of scope for this PR. The cognitive dissonance is minor and mitigated by the assertion message.

Low
Suggestions up to commit a251956
CategorySuggestion                                                                                                                                    Impact
Possible issue
Add timeout to blocking future call

The actionGet() call blocks indefinitely if the future never completes. Although
PERMITS_ACQUIRE_TIMEOUT is passed to acquireAllPrimaryOperationsPermits, the future
itself has no timeout. If the permit acquisition hangs, this thread will block
forever. Use actionGet(timeout) to enforce a timeout on the blocking call.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [135-143]

 private Releasable acquirePrimaryPermits(IndexShard indexShard, ShardRouting shardRouting) throws IOException {
     try {
         PlainActionFuture<Releasable> permitFuture = new PlainActionFuture<>();
         indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
-        return permitFuture.actionGet();
+        return permitFuture.actionGet(PERMITS_ACQUIRE_TIMEOUT);
     } catch (Exception e) {
         throw new IOException("Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e);
     }
 }
Suggestion importance[1-10]: 7

__

Why: Valid concern. While PERMITS_ACQUIRE_TIMEOUT is passed to acquireAllPrimaryOperationsPermits, the actionGet() call on line 139 blocks indefinitely if the future never completes. Using actionGet(PERMITS_ACQUIRE_TIMEOUT) ensures the blocking call respects the timeout, preventing indefinite hangs.

Medium
General
Add logging for missing index

The removeReadOnlyBlock method is called on prepare failure to clean up the
read-only block. However, if the index is deleted between the failure and this
cleanup attempt, indexMetadata will be null and the method silently returns. This is
correct, but the method should log a warning to aid debugging when the index
disappears unexpectedly during cleanup.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [250-254]

 private void removeReadOnlyBlock(String indexName) {
     clusterService.submitStateUpdateTask(
         "remove-read-only-block-for-tiering [" + indexName + "]",
         new ClusterStateUpdateTask(Priority.URGENT) {
             @Override
             public ClusterState execute(ClusterState currentState) {
                 IndexMetadata indexMetadata = currentState.metadata().index(indexName);
                 if (indexMetadata == null) {
+                    logger.warn("Index [{}] not found during read-only block removal", indexName);
                     return currentState;
                 }
Suggestion importance[1-10]: 4

__

Why: The suggestion improves observability by logging when an index is unexpectedly missing during cleanup. This aids debugging in edge cases (e.g., concurrent index deletion). However, the impact is minor since the method already handles null metadata gracefully.

Low
Guard against excessive recursion depth

The executePrepareTiering method recursively retries on failure but does not guard
against stack overflow if MAX_PREPARE_RETRIES is set to a very large value. Consider
using an iterative approach or adding a safeguard to prevent excessive recursion
depth.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [165-177]

 private void executePrepareTiering(
     IndexTieringRequest request,
     ClusterState state,
     ActionListener<AcknowledgedResponse> listener,
     int attempt
 ) {
+    if (attempt > MAX_PREPARE_RETRIES) {
+        throw new IllegalStateException("Retry limit exceeded");
+    }
     PrepareTieringRequest prepareTieringRequest = new PrepareTieringRequest(request.getIndex());
     prepareTieringRequest.timeout(request.timeout());
Suggestion importance[1-10]: 2

__

Why: The suggestion is technically correct but has minimal impact. MAX_PREPARE_RETRIES is hardcoded to 3 (line 48), so recursion depth is bounded and stack overflow is not a realistic concern. The suggested guard is redundant given the existing retry limit checks in lines 164 and 216.

Low
Suggestions up to commit a526ff4
CategorySuggestion                                                                                                                                    Impact
General
Release lock before notifying listeners

The write lock is held during the entire replication finalization, including
listener notifications. If a refresh listener performs expensive I/O or blocks, this
could delay other operations waiting for the write lock. Consider releasing the lock
before notifying listeners or ensuring listeners are non-blocking.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareReadOnlyEngine.java [324-346]

 public void finalizeReplication(CatalogSnapshot incoming) throws IOException {
+    final long maxSeqNo;
+    final String incomingHistoryUUID;
     try (ReleasableLock lock = writeLock.acquire()) {
         ensureOpen();
-        final long maxSeqNo = Long.parseLong(incoming.getUserData().get(SequenceNumbers.MAX_SEQ_NO));
-
-        notifyRefreshListenersBefore();
-        boolean success = false;
-        try {
-            catalogSnapshotManager.applyReplicationSnapshot(incoming);
-            success = true;
-        } finally {
-            notifyRefreshListenersAfter(success);
+        maxSeqNo = Long.parseLong(incoming.getUserData().get(SequenceNumbers.MAX_SEQ_NO));
+        incomingHistoryUUID = incoming.getUserData().get(Engine.HISTORY_UUID_KEY);
+        
+        catalogSnapshotManager.applyReplicationSnapshot(incoming);
+        if (incomingHistoryUUID != null) {
+            this.historyUUID = incomingHistoryUUID;
         }
-        ...
+        localCheckpointTracker.fastForwardProcessedSeqNo(maxSeqNo);
     }
+    // Notify listeners outside the lock
+    notifyRefreshListenersBefore();
+    notifyRefreshListenersAfter(true);
 }
Suggestion importance[1-10]: 6

__

Why: Valid concern about holding the write lock during listener notifications. However, the current pattern matches DataFormatAwareNRTReplicationEngine for consistency. Listeners are expected to be lightweight (e.g., updating stats). Changing this would require careful analysis of listener contracts and potential race conditions.

Low
Avoid blocking transport thread with validation

The preflight validation runs synchronously in the transport thread before any
cluster state mutation. If validation is expensive (e.g., disk space checks across
many nodes), this could block the transport thread. Consider moving the validation
into the cluster state update task or making it async to avoid blocking the
transport layer.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [89-100]

-try {
-    Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), state);
-    hotToWarmTieringService.preflightValidate(state, index);
-} catch (Exception e) {
-    logger.info("Preflight validation failed for DFA index [{}]: {}", request.getIndex(), e.getMessage());
-    listener.onFailure(e);
-    return;
-}
+// Move validation into the cluster state update task or use async validation
+clusterService.submitStateUpdateTask("validate-and-add-block [" + request.getIndex() + "]", 
+    new ClusterStateUpdateTask(Priority.URGENT) {
+        @Override
+        public ClusterState execute(ClusterState currentState) {
+            try {
+                Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), currentState);
+                hotToWarmTieringService.preflightValidate(currentState, index);
+                return addReadOnlyBlock(currentState, request.getIndex());
+            } catch (Exception e) {
+                throw new IllegalStateException("Validation failed", e);
+            }
+        }
+        // ... onFailure, clusterStateProcessed handlers
+    });
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that synchronous validation in the transport thread could be a bottleneck. However, the PR already includes double-validation (preflight + inside cluster state task) for TOCTOU safety. Moving validation entirely into the cluster state task would eliminate the early-rejection benefit. The current design is a deliberate trade-off.

Low
Use async permit acquisition for concurrency

The actionGet() call blocks the thread until permits are acquired or timeout occurs.
If multiple shards are processed sequentially on the same node, this could cause
head-of-line blocking. Consider using async permit acquisition with callbacks to
allow concurrent processing of multiple shards on the same node.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [135-143]

-private Releasable acquirePrimaryPermits(IndexShard indexShard, ShardRouting shardRouting) throws IOException {
-    try {
-        PlainActionFuture<Releasable> permitFuture = new PlainActionFuture<>();
-        indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
-        return permitFuture.actionGet();
-    } catch (Exception e) {
-        throw new IOException("Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e);
-    }
+private void acquirePrimaryPermitsAsync(IndexShard indexShard, ShardRouting shardRouting, 
+                                        ActionListener<Releasable> listener) {
+    indexShard.acquireAllPrimaryOperationsPermits(
+        ActionListener.wrap(
+            listener::onResponse,
+            e -> listener.onFailure(new IOException(
+                "Failed to acquire primary operation permits for shard [" + shardRouting.shardId() + "]", e))
+        ),
+        PERMITS_ACQUIRE_TIMEOUT
+    );
 }
Suggestion importance[1-10]: 4

__

Why: The suggestion identifies a potential concurrency improvement, but TransportBroadcastByNodeAction already processes shards in parallel across nodes. The blocking actionGet() only affects sequential processing within a single node's shard batch, which is acceptable for this use case. The added complexity of async callbacks may not justify the marginal gain.

Low
Use iterative retry instead of recursion

The retry logic uses recursive calls to executePrepareTiering, which could lead to
deep call stacks if MAX_PREPARE_RETRIES is increased. Consider using an iterative
approach or a scheduled retry mechanism to avoid potential stack overflow issues and
improve debuggability.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [165-237]

-private void executePrepareTiering(
+private void executePrepareTieringWithRetry(
     IndexTieringRequest request,
     ClusterState state,
-    ActionListener<AcknowledgedResponse> listener,
-    int attempt
+    ActionListener<AcknowledgedResponse> listener
 ) {
-    PrepareTieringRequest prepareTieringRequest = new PrepareTieringRequest(request.getIndex());
-    prepareTieringRequest.timeout(request.timeout());
+    AtomicInteger attempt = new AtomicInteger(1);
+    ActionListener<BroadcastResponse> retryListener = new ActionListener<>() {
+        @Override
+        public void onResponse(BroadcastResponse response) {
+            if (response.getFailedShards() > 0 && attempt.get() < MAX_PREPARE_RETRIES) {
+                attempt.incrementAndGet();
+                prepareTieringAction.execute(new PrepareTieringRequest(request.getIndex()), this);
+            } else if (response.getFailedShards() > 0) {
+                // final failure
+            } else {
+                // success
+            }
+        }
+        // ... onFailure with similar retry logic
+    };
+    prepareTieringAction.execute(new PrepareTieringRequest(request.getIndex()), retryListener);
+}
 
-    prepareTieringAction.execute(prepareTieringRequest, new ActionListener<BroadcastResponse>() {
-        @Override
-        public void onResponse(BroadcastResponse broadcastResponse) {
-            if (broadcastResponse.getFailedShards() > 0) {
-                if (attempt < MAX_PREPARE_RETRIES) {
-                    ...
-                    executePrepareTiering(request, state, listener, attempt + 1);
-                    return;
-                }
-
Suggestion importance[1-10]: 3

__

Why: The suggestion raises a theoretical concern about stack depth with MAX_PREPARE_RETRIES=3. In practice, 3 recursive calls pose no stack overflow risk. The current recursive approach is clearer and more maintainable than the proposed iterative version with AtomicInteger. The improved code also has incomplete error handling logic.

Low
Suggestions up to commit 9035f42
CategorySuggestion                                                                                                                                    Impact
General
Warn when fabricating missing historyUUID

Fabricating a historyUUID when absent may cause issues during peer recovery or
replica promotion if the primary and replica have different fabricated UUIDs.
Consider logging a warning when fabrication occurs, or failing engine creation if
historyUUID is required for correctness in the warm tier recovery protocol.

server/src/main/java/org/opensearch/index/engine/DataFormatAwareReadOnlyEngine.java [196-199]

 if (this.historyUUID == null) {
+    logger.warn("historyUUID missing from commit for shard [{}], fabricating new UUID. This may cause recovery issues.", shardId);
     this.historyUUID = UUIDs.randomBase64UUID();
 }
Suggestion importance[1-10]: 7

__

Why: Valid suggestion to add a warning when historyUUID is fabricated. This matches the pattern used in DataFormatAwareNRTReplicationEngine and helps with debugging recovery issues. The concern about replica promotion is legitimate, though the fabrication logic mirrors the existing NRT engine behavior.

Medium
Add retry for read-only block removal

If removeReadOnlyBlock fails (e.g., cluster state update rejected), the index
remains in a stuck read-only state. The user must manually intervene. Consider
adding a retry mechanism or a background task that periodically attempts to remove
orphaned read-only blocks from failed tiering operations to improve resilience.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [245-286]

-private void removeReadOnlyBlock(String indexName) {
+// Add retry logic or background cleanup task for orphaned read-only blocks
+private void removeReadOnlyBlockWithRetry(String indexName, int attempt) {
+    if (attempt > 3) {
+        logger.error("Failed to remove read-only block after 3 attempts for index [{}]. Manual intervention required.", indexName);
+        return;
+    }
     clusterService.submitStateUpdateTask(
         "remove-read-only-block-for-tiering [" + indexName + "]",
         new ClusterStateUpdateTask(Priority.URGENT) {
             @Override
             public void onFailure(String source, Exception e) {
-                logger.warn(
-                    "Failed to remove read-only block for index [{}] after tiering failure: {}. "
-                        + "Block can be removed manually via index settings.",
-                    indexName,
-                    e
-                );
+                logger.warn("Retry {}/3: Failed to remove read-only block for index [{}]: {}", attempt, indexName, e);
+                removeReadOnlyBlockWithRetry(indexName, attempt + 1);
             }
             // ...
         }
     );
 }
Suggestion importance[1-10]: 6

__

Why: Valid concern about orphaned read-only blocks if removal fails. Adding retry logic would improve resilience. However, the suggestion's implementation is simplistic (no exponential backoff, no state tracking). A more robust solution would involve tracking failed removals and a background cleanup task, but the basic retry idea has merit.

Low
Avoid blocking transport thread with validation

The preflight validation runs synchronously on the transport thread before any
cluster state mutation. If validation is expensive (e.g., disk space checks across
many nodes), this could block the transport thread. Consider moving the validation
into the cluster state update task or making it asynchronous to avoid blocking the
transport layer.

server/src/main/java/org/opensearch/storage/action/tiering/TransportHotToWarmTierAction.java [96-100]

-try {
-    Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), state);
-    hotToWarmTieringService.preflightValidate(state, index);
-} catch (Exception e) {
-    logger.info("Preflight validation failed for DFA index [{}]: {}", request.getIndex(), e.getMessage());
-    listener.onFailure(e);
-    return;
-}
+// Move validation into the cluster state update task to avoid blocking transport thread
+clusterService.submitStateUpdateTask(
+    "preflight-validate-and-add-block [" + request.getIndex() + "]",
+    new ClusterStateUpdateTask(Priority.URGENT) {
+        @Override
+        public ClusterState execute(ClusterState currentState) {
+            try {
+                Index index = resolveRequestIndex(indexNameExpressionResolver, request.getIndex(), currentState);
+                hotToWarmTieringService.preflightValidate(currentState, index);
+            } catch (Exception e) {
+                throw new IllegalStateException("Preflight validation failed: " + e.getMessage(), e);
+            }
+            // Proceed to add read-only block...
+        }
+        // ...
+    }
+);
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that synchronous validation on the transport thread could cause blocking. However, the current implementation already runs validation before expensive operations (block addition, prepare), which is the intended design. Moving validation into the cluster state task would complicate error handling and lose the early-rejection benefit. The concern is valid but the proposed solution may not be optimal.

Low
Use async callback for permit acquisition

Using actionGet() blocks the thread until permits are acquired or timeout occurs. If
multiple shards are processed sequentially on the same node, this could cause
cumulative blocking delays. Consider using the async callback pattern to avoid
blocking the thread pool, especially if the action processes many shards
concurrently.

server/src/main/java/org/opensearch/storage/action/tiering/TransportPrepareTieringAction.java [136-140]

-PlainActionFuture<Releasable> permitFuture = new PlainActionFuture<>();
-indexShard.acquireAllPrimaryOperationsPermits(permitFuture, PERMITS_ACQUIRE_TIMEOUT);
-return permitFuture.actionGet();
+// Use async callback to avoid blocking thread pool
+indexShard.acquireAllPrimaryOperationsPermits(new ActionListener<Releasable>() {
+    @Override
+    public void onResponse(Releasable permit) {
+        try {
+            syncAndFlush(indexShard, shardRouting);
+            waitForRemoteSync(indexShard, shardRouting);
+            verifyNoUncommittedOps(indexShard, shardRouting);
+        } catch (IOException e) {
+            // handle error
+        } finally {
+            permit.close();
+        }
+    }
+    @Override
+    public void onFailure(Exception e) {
+        throw new IOException("Failed to acquire permits", e);
+    }
+}, PERMITS_ACQUIRE_TIMEOUT);
Suggestion importance[1-10]: 4

__

Why: The suggestion to use async callbacks is theoretically better, but the current implementation using actionGet() is consistent with similar operations in the codebase (e.g., flush, refresh). The blocking is intentional to ensure sequential execution of sync→flush→refresh→verify. The concern about cumulative delays is valid but the impact is limited since this is a rare administrative operation.

Low

@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from 7bc2965 to 890d029 Compare May 18, 2026 19:25
Comment thread server/src/main/java/org/opensearch/index/shard/IndexShard.java Outdated
@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from a526ff4 to a251956 Compare May 21, 2026 14:46
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit a251956

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for a251956: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from a251956 to ff0459c Compare May 21, 2026 16:35
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit ff0459c

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for ff0459c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from ff0459c to 4b088ef Compare May 21, 2026 17:00
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 4b088ef

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4b088ef: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 4b088ef: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

nishchay21 and others added 12 commits May 22, 2026 14:34
Introduces a purpose-built read-only engine for primaries on indices
marked with isWarmIndex(). The engine operates on CatalogSnapshot,
has no IndexWriter, uses NoOpTranslogManager, and rejects all write
operations. Reads route through TieredSubdirectoryAwareDirectory to
remote storage.

Key changes:
- DataFormatAwareReadOnlyEngine implementing Indexer interface
- DataFormatAwareIndexerFactory updated to create read-only engine
  for warm primaries
- Unit tests for the read-only engine
- Integration test skeleton for warm engine scenarios
- run.gradle updated for multi-node warm testing
- Test fixes for cherry-picked NRT replication changes

Signed-off-by: nishchay21 <nishcha@amazon.com>
…auto_expand_replicas, IndexShard DFA sync fix

- TransportPrepareTieringAction: broadcast action for flush+sync on primaries
- TransportHotToWarmTierAction: read-only block + prepare + retry + tier flow for DFA
- TieringService: auto_expand_replicas during tiering
- HotToWarmTieringService: read_only_allow_delete on cancel
- WarmToHotTieringService: remove read-only block on warm-to-hot
- IndexShard: fix isRemoteSegmentStoreInSync for DFA using CatalogSnapshot
- NRT engine: null-check on commitResult (from mgodwan fix)
- Unit tests for all changes

Signed-off-by: nishchay21 <nishchay21@users.noreply.github.com>
Signed-off-by: nishchay21 <nishcha@amazon.com>
Signed-off-by: nishchay21 <nishchay21@users.noreply.github.com>
Signed-off-by: nishchay21 <nishcha@amazon.com>
Signed-off-by: nishchay21 <nishcha@amazon.com>
Return NotFound for directory HEAD requests using registry prefix
matching instead of hitting remote store. This matches LocalFileSystem
behavior and allows DataFusion to fall through to list_with_cache.

- try_head_from_registry: prefix match detects directories, returns NotFound
- Removed fragile heuristic that used path_str.contains dot
- Tests updated to expect NotFound for directory HEAD
- list test registers local file in registry (list returns registry only)

Signed-off-by: nishchay21 <nishcha@amazon.com>
Signed-off-by: nishchay21 <nishcha@amazon.com>
Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
…pload to remote

Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
Signed-off-by: Vishwas Garg <vishwasgarg14@gmail.com>
@vishwasgarg18 vishwasgarg18 force-pushed the feature/data-format-aware-read-only-engine branch from 4b088ef to 5165667 Compare May 22, 2026 09:05
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 5165667

@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 5165667: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

❌ Patch coverage is 14.40000% with 107 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.49%. Comparing base (afa213c) to head (5165667).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
...e/action/tiering/TransportHotToWarmTierAction.java 3.84% 100 Missing ⚠️
...in/java/org/opensearch/index/shard/IndexShard.java 44.44% 4 Missing and 1 partial ⚠️
...org/opensearch/storage/tiering/TieringService.java 60.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21720      +/-   ##
============================================
+ Coverage     73.46%   73.49%   +0.03%     
- Complexity    75254    75299      +45     
============================================
  Files          6023     6024       +1     
  Lines        341475   341712     +237     
  Branches      49141    49162      +21     
============================================
+ Hits         250855   251143     +288     
+ Misses        70680    70556     -124     
- Partials      19940    20013      +73     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-diff-analyzer Maintainer to skip code-diff-analyzer check, after reviewing issues in AI analysis.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants