-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug
After the cluster nodes are reset, the cluster state becomes abnormal, and shards cannot be allocated. Shards in the UNASSIGNED/INITIALIZING state will never be recovered and their recovery process receives a ShardNotInPrimaryModeException("Shard is not in primary mode").
security-auditlog-2025.08.26 0 p STARTED 30 145.7kb 192.168.6.2 opensearch-node-base2
security-auditlog-2025.08.26 0 r STARTED 30 258.3kb 192.168.6.3 opensearch-node-base1
.plugins-ml-config 0 p STARTED 1 4kb 192.168.6.2 opensearch-node-base2
.plugins-ml-config 0 r STARTED 1 4kb 192.168.6.3 opensearch-node-base1
.plugins-ml-config 0 r UNASSIGNED
.opensearch-observability 0 p STARTED 0 208b 192.168.6.2 opensearch-node-base2
.opensearch-observability 0 r STARTED 0 208b 192.168.6.3 opensearch-node-base1
.opensearch-observability 0 r UNASSIGNED
.ql-datasources 0 p STARTED 0 208b 192.168.6.2 opensearch-node-base2
.ql-datasources 0 r UNASSIGNED
.ql-datasources 0 r UNASSIGNED
.opensearch-sap-log-types-config 0 p STARTED 192.168.6.2 opensearch-node-base2
.opensearch-sap-log-types-config 0 r INITIALIZING 192.168.6.3 opensearch-node-base1
.opensearch-sap-log-types-config 0 r UNASSIGNED
.kibana_92668751_admin_1 0 r STARTED 1 5.3kb 192.168.6.2 opensearch-node-base2
.kibana_92668751_admin_1 0 p STARTED 1 5.3kb 192.168.6.3 opensearch-node-base1
top_queries-2025.08.26-70653 0 p STARTED 27 59.9kb 192.168.6.2 opensearch-node-base2
top_queries-2025.08.26-70653 0 r STARTED 27 59.8kb 192.168.6.3 opensearch-node-base1
.opendistro_security 0 p STARTED 10 83.1kb 192.168.6.2 opensearch-node-base2
.opendistro_security 0 r STARTED 10 55.9kb 192.168.6.3 opensearch-node-base1
.opendistro_security 0 r UNASSIGNED
.kibana_1 0 p STARTED 0 208b 192.168.6.2 opensearch-node-base2
.kibana_1 0 r INITIALIZING 192.168.6.3 opensearch-node-base1
Related component
Cluster Manager
To Reproduce
Step1: Create a three-node cluster (Node1, Node2, Node3).
Step2: Stop Node1 and wait for the cluster state to become green.
Step3: Stop Node2 and wait for 1 minute.
Step4: Start Node2 and wait for 1 minute.
Step5: Start Node1.
The issue is reproduced when Node3 eventually becomes the master node after Node2 is started (Step 4).
Expected behavior
The cluster state returns to green.
Additional Details
The issue is that JoinTaskExecutor uses currentState.metadata to overwrite newState.metadata, which leads to inconsistency between the Metadata and the RoutingTable. The detailed process analysis is as follows:
@startuml
participant JoinTaskExecutor
participant AllocationService
participant ReplicationTracker
database ClusterState
JoinTaskExecutor <-- ClusterState : currentState
JoinTaskExecutor --> AllocationService: disassociateDeadNodes
AllocationService --> AllocationService : 1. update <font color="red">RoutingTable & IndexMetadata</font>
note left
Primary shard changes and PrimaryTerm += 1.
endnote
JoinTaskExecutor <-- AllocationService : newState
JoinTaskExecutor --> JoinTaskExecutor : <font color="red">2. newState.metadata = updateMetadataWithRepositoriesMetadata(currentState)</font>
note left
PrimaryTerm is reset to currentState.
endnote
JoinTaskExecutor --> ClusterState : newState
ReplicationTracker <-- ClusterState : newState
ReplicationTracker --> ReplicationTracker : <font color="red">3. newPrimaryTerm is equal to pendingPrimaryTerm</font>
note left
Replica shard is not activated by activatePrimaryMode since PrimaryTerm is same with local.
endnote
@enduml
Problem code:
return results.build(
allocationService.adaptAutoExpandReplicas(
newState.nodes(nodesBuilder)
.metadata(updateMetadataWithRepositoriesMetadata(currentState.metadata(), repositoriesMetadata)) // this line
.build()
)
);
Metadata
Metadata
Assignees
Labels
Type
Projects
Status