Skip to content

[BUG] After the nodes are reset, the cluster state becomes abnormal, and shards cannot be allocated. #19150

@for-hck

Description

@for-hck

Describe the bug

After the cluster nodes are reset, the cluster state becomes abnormal, and shards cannot be allocated. Shards in the UNASSIGNED/INITIALIZING state will never be recovered and their recovery process receives a ShardNotInPrimaryModeException("Shard is not in primary mode").

security-auditlog-2025.08.26     0 p STARTED      30 145.7kb 192.168.6.2 opensearch-node-base2
security-auditlog-2025.08.26     0 r STARTED      30 258.3kb 192.168.6.3 opensearch-node-base1
.plugins-ml-config               0 p STARTED       1     4kb 192.168.6.2 opensearch-node-base2
.plugins-ml-config               0 r STARTED       1     4kb 192.168.6.3 opensearch-node-base1
.plugins-ml-config               0 r UNASSIGNED
.opensearch-observability        0 p STARTED       0    208b 192.168.6.2 opensearch-node-base2
.opensearch-observability        0 r STARTED       0    208b 192.168.6.3 opensearch-node-base1
.opensearch-observability        0 r UNASSIGNED
.ql-datasources                  0 p STARTED       0    208b 192.168.6.2 opensearch-node-base2
.ql-datasources                  0 r UNASSIGNED
.ql-datasources                  0 r UNASSIGNED
.opensearch-sap-log-types-config 0 p STARTED                 192.168.6.2 opensearch-node-base2
.opensearch-sap-log-types-config 0 r INITIALIZING            192.168.6.3 opensearch-node-base1
.opensearch-sap-log-types-config 0 r UNASSIGNED
.kibana_92668751_admin_1         0 r STARTED       1   5.3kb 192.168.6.2 opensearch-node-base2
.kibana_92668751_admin_1         0 p STARTED       1   5.3kb 192.168.6.3 opensearch-node-base1
top_queries-2025.08.26-70653     0 p STARTED      27  59.9kb 192.168.6.2 opensearch-node-base2
top_queries-2025.08.26-70653     0 r STARTED      27  59.8kb 192.168.6.3 opensearch-node-base1
.opendistro_security             0 p STARTED      10  83.1kb 192.168.6.2 opensearch-node-base2
.opendistro_security             0 r STARTED      10  55.9kb 192.168.6.3 opensearch-node-base1
.opendistro_security             0 r UNASSIGNED
.kibana_1                        0 p STARTED       0    208b 192.168.6.2 opensearch-node-base2
.kibana_1                        0 r INITIALIZING            192.168.6.3 opensearch-node-base1
Image

Related component

Cluster Manager

To Reproduce

Step1: Create a three-node cluster (Node1, Node2, Node3).
Step2: Stop Node1 and wait for the cluster state to become green.
Step3: Stop Node2 and wait for 1 minute.
Step4: Start Node2 and wait for 1 minute.
Step5: Start Node1.

The issue is reproduced when Node3 eventually becomes the master node after Node2 is started (Step 4).

Expected behavior

The cluster state returns to green.

Additional Details

The issue is that JoinTaskExecutor uses currentState.metadata to overwrite newState.metadata, which leads to inconsistency between the Metadata and the RoutingTable. The detailed process analysis is as follows:

@startuml
participant JoinTaskExecutor
participant AllocationService
participant ReplicationTracker
database ClusterState

JoinTaskExecutor <-- ClusterState : currentState
JoinTaskExecutor --> AllocationService: disassociateDeadNodes 
AllocationService --> AllocationService : 1. update <font color="red">RoutingTable & IndexMetadata</font>
note left
     Primary shard changes and PrimaryTerm += 1.
endnote
JoinTaskExecutor <-- AllocationService : newState
JoinTaskExecutor --> JoinTaskExecutor : <font color="red">2. newState.metadata = updateMetadataWithRepositoriesMetadata(currentState)</font>
note left
     PrimaryTerm is reset to currentState.
endnote
JoinTaskExecutor --> ClusterState : newState

ReplicationTracker <-- ClusterState : newState
ReplicationTracker --> ReplicationTracker : <font color="red">3. newPrimaryTerm is equal to pendingPrimaryTerm</font>
note left
    Replica shard is not activated by activatePrimaryMode since PrimaryTerm is same with local.
endnote
@enduml
Image

Problem code:

            return results.build(
                allocationService.adaptAutoExpandReplicas(
                    newState.nodes(nodesBuilder)
                        .metadata(updateMetadataWithRepositoriesMetadata(currentState.metadata(), repositoriesMetadata)) // this line
                        .build()
                )
            );

Metadata

Metadata

Assignees

Labels

Type

Projects

Status

🆕 New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions