Skip to content

[BUG] Evaluate listAll() API implementation for CompositeDirectory. #17527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
skumawat2025 opened this issue Mar 6, 2025 · 0 comments
Open
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework

Comments

@skumawat2025
Copy link
Contributor

skumawat2025 commented Mar 6, 2025

Describe the bug

Part of this meta issue: #13149

Description:
The current implementation of the listAll() API in CompositeDirectory needs evaluation. CompositeDirectory is a hybrid directory utilizing both localDirectory and RemoteSegmentDirectory. The listAll() API is crucial for file cleanup and obtaining the latest commit SegmentInfo.

Issue:
When listAll() lists both local and remote files, some tests become flaky. For example, the test WarmIndexSegmentReplicationIT.testReplicationPostDeleteAndForceMerge() is affected.

Stack trace:

févr. 25, 2025 2:13:43 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
AVERTISSEMENT: Uncaught exception in thread: Thread[#75,opensearch[node_t2][generic][T#3],5,TGRP-WarmIndexRemoteStoreSegmentReplicationIT]
java.lang.AssertionError: new global checkpoint [-1] is lower than previous one [8]
at __randomizedtesting.SeedInfo.seed([EA432349BB4BCDD6]:0)
at org.opensearch.index.seqno.ReplicationTracker.updateGlobalCheckpointOnPrimary(ReplicationTracker.java:1752)
at org.opensearch.index.seqno.ReplicationTracker.activatePrimaryMode(ReplicationTracker.java:1389)
at org.opensearch.index.shard.IndexShard.lambda$updateShardState$5(IndexShard.java:784)
at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4276)
at org.opensearch.index.shard.IndexShard$5.onResponse(IndexShard.java:4246)
at org.opensearch.index.shard.IndexShard.lambda$asyncBlockOperations$37(IndexShard.java:4197)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.index.shard.IndexShardOperationPermits$1.doRun(IndexShardOperationPermits.java:157)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
[2025-02-25T14:13:43,384][INFO ][o.o.i.s.CompositeDirectory] [node_t2] listAll() call stack (last 10 methods): [org.apache.lucene.store.FilterDirectory.listAll, org.apache.lucene.store.FilterDirectory.listAll, org.apache.lucene.index.SegmentInfos.getLastCommitGeneration, org.apache.lucene.index.SegmentInfos.getLastCommitSegmentsFileName, org.opensearch.index.shard.RemoteStoreRefreshListener.isRefreshAfterCommit, org.opensearch.index.shard.RemoteStoreRefreshListener.syncSegments, org.opensearch.index.shard.RemoteStoreRefreshListener.performAfterRefreshWithPermit, org.opensearch.index.shard.ReleasableRetryableRefreshListener.runAfterRefreshWithPermit, org.opensearch.index.shard.ReleasableRetryableRefreshListener.afterRefresh, org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed]
[2025-02-25T14:13:43,385][INFO ][o.o.i.s.CompositeDirectory] [node_t2] listAll Composite Directory[Composite Directory @ c9e9359]: Local Directory files - [_0.cfe_block_0, _0.cfs_block_0, _0.si_block_0, _0_1.fnm_block_0, _0_1_Lucene90_0.dvd_block_0, _0_1_Lucene90_0.dvm_block_0, _1.cfe_block_0, _1.cfs_block_0, _1.si_block_0, _2.cfe_block_0, _2.cfs_block_0, _2.si_block_0, _3.cfe_block_0, _3.cfs_block_0, _3.si_block_0, _4.fdm_block_0, _4.fdt_block_0, _4.fdx_block_0, _4.fnm_block_0, _4.kdd_block_0, _4.kdi_block_0, _4.kdm_block_0, _4.nvd_block_0, _4.nvm_block_0, _4.si_block_0, _4_Lucene101_0.doc_block_0, _4_Lucene101_0.pos_block_0, _4_Lucene101_0.psm_block_0, _4_Lucene101_0.tim_block_0, _4_Lucene101_0.tip_block_0, _4_Lucene101_0.tmd_block_0, _4_Lucene90_0.dvd_block_0, _4_Lucene90_0.dvm_block_0, _5.cfe_block_0, _5.cfs_block_0, _5.si_block_0, _6.cfe_block_0, _6.cfs_block_0, _6.si_block_0, _7.cfe_block_0, _7.cfs_block_0, _7.si_block_0, _8.cfe, _8.cfs, _8.si, segments_3_block_0, segments_5, segments_6, write.lock]
[2025-02-25T14:13:43,385][INFO ][o.o.i.s.CompositeDirectory] [node_t2] Composite Directory[Composite Directory @ c9e9359]: Remote Directory files - [_6.cfe, _4_Lucene101_0.tim, _7.cfs, _4_Lucene101_0.tip, _4_Lucene90_0.dvd, _4.fdx, _6.cfs, _7.cfe, _7.si, _5.si, _4_Lucene101_0.doc, _4.nvm, _4.fnm, _5.cfs, _4.nvd, segments_3, _4_Lucene90_0.dvm, _4.fdt, _4_Lucene101_0.psm, _4_Lucene101_0.tmd, _4.kdm, _5.cfe, _4.kdi, _4.fdm, _4.si, _4_Lucene101_0.pos, _4.kdd, _6.si]
[2025-02-25T14:13:43,385][INFO ][o.o.i.s.CompositeDirectory] [node_t2] Composite Directory[Composite Directory @ c9e9359]: listAll() returns : [_4.fdm, _4.fdt, _4.fdx, _4.fnm, _4.kdd, _4.kdi, _4.kdm, _4.nvd, _4.nvm, _4.si, _4_Lucene101_0.doc, _4_Lucene101_0.pos, _4_Lucene101_0.psm, _4_Lucene101_0.tim, _4_Lucene101_0.tip, _4_Lucene101_0.tmd, _4_Lucene90_0.dvd, _4_Lucene90_0.dvm, _5.cfe, _5.cfs, _5.si, _6.cfe, _6.cfs, _6.si, _7.cfe, _7.cfs, _7.si, _8.cfe, _8.cfs, _8.si, segments_3, segments_5, segments_6, write.lock]
[2025-02-25T14:13:43,390][TRACE][o.o.i.r.c.PublishCheckpointAction] [node_t2] [shardId 0] Publishing replication checkpoint [ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=2, segmentsGen=6, version=31, size=3893, codec=Lucene101}]
[2025-02-25T14:13:43,390][TRACE][o.o.i.r.c.PublishCheckpointAction] [node_t2] [[test-idx-1][0]] op [indices:admin/publishCheckpoint] completed on primary for request [PublishCheckpointRequest{checkpoint=ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=2, segmentsGen=6, version=31, size=3893, codec=Lucene101}}]
[2025-02-25T14:13:43,390][DEBUG][o.o.i.r.c.PublishCheckpointAction] [node_t2] [shardId 0] Completed publishing checkpoint [ReplicationCheckpoint{shardId=[test-idx-1][0], primaryTerm=2, segmentsGen=6, version=31, size=3893, codec=Lucene101}], timing: 0
[2025-02-25T14:13:43,390][INFO ][o.o.i.r.WarmIndexRemoteStoreSegmentReplicationIT] [testReplicationPostDeleteAndForceMerge] Sandeep - 10
févr. 25, 2025 2:13:43 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
AVERTISSEMENT: Uncaught exception in thread: Thread[#88,opensearch[node_t2][refresh][T#1],5,TGRP-WarmIndexRemoteStoreSegmentReplicationIT]
java.lang.AssertionError: global checkpoint is not up-to-date, expected: -1 but was: 8
at __randomizedtesting.SeedInfo.seed([EA432349BB4BCDD6]:0)
at org.opensearch.index.seqno.ReplicationTracker.invariant(ReplicationTracker.java:920)
at org.opensearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:1695)
at org.opensearch.index.shard.IndexShard.updateLocalCheckpointForShard(IndexShard.java:3197)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.updateLocalCheckpointForShard(TransportReplicationAction.java:1338)
at org.opensearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:341)
at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:184)
at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:178)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryResult.runPostReplicationActions(TransportReplicationAction.java:728)
at org.opensearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:178)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182)
at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:355)
at org.opensearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnPrimary(TransportShardRefreshAction.java:100)
at org.opensearch.action.admin.indices.refresh.TransportShardRefreshAction.shardOperationOnPrimary(TransportShardRefreshAction.java:57)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1333)
at org.opensearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:150)
at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:654)
at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:547)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$36(IndexShard.java:4185)
at org.opensearch.core.action.ActionListener$3.onResponse(ActionListener.java:132)
at org.opensearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:58)
at org.opensearch.index.shard.IndexShardOperationPermits$2.doRun(IndexShardOperationPermits.java:286)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
févr. 25, 2025 2:13:43 PM com.carrotsearch.randomizedtesting.RandomizedRunner$QueueUncaughtExceptionsHandler uncaughtException
AVERTISSEMENT: Uncaught exception in thread: Thread[#103,opensearch[node_t2][refresh][T#2],5,TGRP-WarmIndexRemoteStoreSegmentReplicationIT]
java.lang.AssertionError: global checkpoint is not up-to-date, expected: -1 but was: 8
at __randomizedtesting.SeedInfo.seed([EA432349BB4BCDD6]:0)
at org.opensearch.index.seqno.ReplicationTracker.invariant(ReplicationTracker.java:920)
at org.opensearch.index.seqno.ReplicationTracker.updateLocalCheckpoint(ReplicationTracker.java:1695)
at org.opensearch.index.shard.IndexShard.updateLocalCheckpointForShard(IndexShard.java:3197)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.updateLocalCheckpointForShard(TransportReplicationAction.java:1338)
at org.opensearch.action.support.replication.ReplicationOperation.updateCheckPoints(ReplicationOperation.java:341)
at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:184)
at org.opensearch.action.support.replication.ReplicationOperation$1.onResponse(ReplicationOperation.java:178)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryResult.runPostReplicationActions(TransportReplicationAction.java:728)
at org.opensearch.action.support.replication.ReplicationOperation.handlePrimaryResult(ReplicationOperation.java:178)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182)
at org.opensearch.core.action.ActionListener.completeWith(ActionListener.java:355)
at org.opensearch.indices.replication.checkpoint.PublishCheckpointAction.shardOperationOnPrimary(PublishCheckpointAction.java:194)
at org.opensearch.indices.replication.checkpoint.PublishCheckpointAction.shardOperationOnPrimary(PublishCheckpointAction.java:52)
at org.opensearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1333)
at org.opensearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:150)
at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:654)
at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:547)
at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82)
at org.opensearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$36(IndexShard.java:4185)
at org.opensearch.core.action.ActionListener$3.onResponse(ActionListener.java:132)
at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:310)
at org.opensearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:255)
at org.opensearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:4156)
at org.opensearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:1262)
at org.opensearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:544)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at org.opensearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:483)
at org.opensearch.wlm.WorkloadManagementTransportInterceptor$RequestHandler.messageReceived(WorkloadManagementTransportInterceptor.java:63)
at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:108)
at org.opensearch.transport.TransportService$7.doRun(TransportService.java:1048)
at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:994)
at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)

Additional context:
This issue may have broader implications for warm index functionality and remote store operations. It's crucial to ensure that the listAll() API provides accurate and consistent results across all scenarios.

Related component

No response

To Reproduce

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior

All tests should pass with correct implementation of listAll()

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@skumawat2025 skumawat2025 added bug Something isn't working untriaged labels Mar 6, 2025
@gbbafna gbbafna added Storage:Durability Issues and PRs related to the durability framework and removed untriaged labels Mar 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework
Projects
Status: 🆕 New
Development

No branches or pull requests

3 participants