Skip to content

Fix auto_expand_replicas creating more replicas than awareness allows#21752

Open
cuonghm2809 wants to merge 1 commit into
opensearch-project:mainfrom
cuonghm2809:fix/auto-expand-replicas-awareness-2984
Open

Fix auto_expand_replicas creating more replicas than awareness allows#21752
cuonghm2809 wants to merge 1 commit into
opensearch-project:mainfrom
cuonghm2809:fix/auto-expand-replicas-awareness-2984

Conversation

@cuonghm2809
Copy link
Copy Markdown
Contributor

Summary

Fixes #2984

When auto_expand_replicas is enabled on indices (e.g., system indices with 1-20) and zone/rack awareness is configured, the replica count expands beyond what awareness constraints allow. This causes replicas to remain permanently unassigned, resulting in a yellow cluster that cannot self-heal.

Root Cause

AutoExpandReplicas.getDesiredNumberOfReplicas() calls shouldAutoExpandToNode() on each decider to count eligible nodes. However, AwarenessAllocationDecider did not override this method — it returned the default Decision.ALWAYS for every node. This meant awareness constraints were completely ignored when computing the desired replica count, even though those same constraints later blocked the allocation in canAllocate().

Fix

Override shouldAutoExpandToNode() in AwarenessAllocationDecider to limit the number of nodes counted per awareness attribute value (zone/rack) based on:

  1. Awareness capacity — computes the maximum achievable shard copies across all attribute values using computeMaxAchievableCopies(), which finds the largest S where sum(min(nodesPerValue_i, ceil(S/K))) >= S
  2. Auto-expand max setting — uses the auto_expand_replicas max as an upper bound so the computed node count is never reduced by the max setting to a value that awareness can't handle
  3. Existing behavior preservedauto_expand_replicas: 0-all continues to bypass awareness (tested and intentional), and indices without auto_expand_replicas are not affected at all

Examples

Scenario Before After
3 zones × 3 nodes, 0-20 YELLOW (awareness blocks) GREEN (8 replicas)
7 rack_ids, 9 pods, 1-20 (issue #2984) YELLOW GREEN (8 replicas)
3 forced zones (1 empty), 6 nodes, 0-20 YELLOW GREEN (3 replicas)
3 racks (5,1,1 nodes), 0-5 YELLOW (2 unassigned) GREEN (3 replicas)
7 racks, 40 nodes (uneven), 1-20 YELLOW (1 unassigned) GREEN (19 replicas)
Fixed number_of_replicas: 10 (no auto-expand) No change No change

Test plan

  • testAutoExpandReplicasWithAwarenessEqualZones — 3 equal zones, auto-expand 0-20
  • testAutoExpandReplicasWithAwarenessUnequalZones — 2 unequal zones (2 vs 5 nodes)
  • testAutoExpandReplicasWithForcedAwarenessAndEmptyZone — forced zone with no nodes
  • testAutoExpandReplicasWithManyRackIds — 7 rack_ids across 9 pods (issue [BUG] auto_expand_replicas creating more replicas than shard allocation awareness allows #2984 scenario)
  • testAutoExpandReplicasWithUnevenRacksAndExplicitMax — uneven racks with explicit auto-expand max
  • testAutoExpandReplicasWithAwarenessComputeMaxCopies — unit tests for computeMaxAchievableCopies helper
  • testIgnoredByAutoExpandReplicasToAll — existing test, 0-all still bypasses awareness
  • All existing AwarenessAllocationTests pass (no regressions)
  • AutoExpandReplicasTests pass
  • AllocationDecidersTests pass

Override shouldAutoExpandToNode() in AwarenessAllocationDecider to limit
the number of nodes counted during auto-expand replica calculation based
on awareness constraints. Previously, awareness was not consulted when
computing the desired replica count, causing auto_expand_replicas to
create more replicas than zone/rack awareness could place, resulting in
permanently yellow clusters.

The fix computes the maximum achievable shard copies considering both the
awareness attribute distribution and the auto_expand max setting, then
caps the per-attribute-value node count accordingly.

Resolves opensearch-project#2984

Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
@cuonghm2809 cuonghm2809 requested a review from a team as a code owner May 20, 2026 02:58
@github-actions github-actions Bot added bug Something isn't working distributed framework labels May 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Possible Issue

When forcedAwarenessAttributes contains an attribute not in awarenessAttributes, the code iterates over awarenessAttributes but accesses forcedAwarenessAttributes using the current attribute. If the forced attribute differs, forcedValues will be null and the forced values are silently ignored. This can cause incorrect numberOfAttributeValues and maxCopies calculations, leading to over-allocation or under-allocation of replicas.

for (String awarenessAttribute : awarenessAttributes) {
    String nodeAttributeValue = node.getAttributes().get(awarenessAttribute);
    if (nodeAttributeValue == null) {
        return allocation.decision(
            Decision.NO,
            NAME,
            "node does not contain the awareness attribute [%s]; required attributes cluster setting [%s=%s]",
            awarenessAttribute,
            CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING.getKey(),
            allocation.debugDecision() ? Strings.collectionToCommaDelimitedString(awarenessAttributes) : null
        );
    }

    Map<String, Integer> nodeCountPerAttributeValue = new HashMap<>();
    for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) {
        if (dataNode.isSearchNode()) {
            continue;
        }
        String attrValue = dataNode.getAttributes().get(awarenessAttribute);
        if (attrValue != null) {
            nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum);
        }
    }

    Set<String> allAttributeValues = new HashSet<>(nodeCountPerAttributeValue.keySet());
    List<String> forcedValues = forcedAwarenessAttributes.get(awarenessAttribute);
    if (forcedValues != null) {
        allAttributeValues.addAll(forcedValues);
    }
    int numberOfAttributeValues = allAttributeValues.size();

    if (numberOfAttributeValues <= 1) {
        continue;
    }
Possible Issue

The position calculation uses compareTo on node IDs to determine ordering. If node IDs are not lexicographically ordered in the same way as insertion order or cluster state order, nodes may be incorrectly excluded. This can cause replicas to be rejected even when capacity exists, or accepted when they should be rejected, leading to unbalanced allocation or unassigned shards.

long position = 0;
for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) {
    if (dataNode.isSearchNode()) {
        continue;
    }
    String attrValue = dataNode.getAttributes().get(awarenessAttribute);
    if (nodeAttributeValue.equals(attrValue) && dataNode.getId().compareTo(node.getId()) < 0) {
        position++;
    }
}

@github-actions
Copy link
Copy Markdown
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Optimize duplicate node iteration

The loop iterates over all data nodes twice (once for counting, once for position
calculation), which is inefficient for large clusters. Consider caching the filtered
data nodes list or combining the operations to reduce iteration overhead.

server/src/main/java/org/opensearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java [174-182]

+List<DiscoveryNode> eligibleDataNodes = new ArrayList<>();
 for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) {
-    if (dataNode.isSearchNode()) {
-        continue;
-    }
-    String attrValue = dataNode.getAttributes().get(awarenessAttribute);
-    if (attrValue != null) {
-        nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum);
+    if (!dataNode.isSearchNode()) {
+        eligibleDataNodes.add(dataNode);
+        String attrValue = dataNode.getAttributes().get(awarenessAttribute);
+        if (attrValue != null) {
+            nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum);
+        }
     }
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies that data nodes are iterated twice in the method, which could be optimized by caching. However, the impact is moderate as this is only called during allocation decisions, not in hot paths. The optimization would improve performance in large clusters but is not critical.

Low
Ensure stable node position ordering

The position calculation uses string comparison on node IDs which may not provide
stable ordering across cluster restarts if node IDs change. Consider using a more
stable ordering mechanism or document this assumption clearly.

server/src/main/java/org/opensearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java [203-212]

 long position = 0;
 for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) {
     if (dataNode.isSearchNode()) {
         continue;
     }
     String attrValue = dataNode.getAttributes().get(awarenessAttribute);
-    if (nodeAttributeValue.equals(attrValue) && dataNode.getId().compareTo(node.getId()) < 0) {
-        position++;
+    if (nodeAttributeValue.equals(attrValue)) {
+        int comparison = dataNode.getId().compareTo(node.getId());
+        if (comparison < 0 || (comparison == 0 && !dataNode.equals(node))) {
+            position++;
+        }
     }
 }
Suggestion importance[1-10]: 2

__

Why: The suggestion misunderstands the code. The improved_code adds an unnecessary check comparison == 0 && !dataNode.equals(node) which would never be true since if comparison == 0, the IDs are equal and the nodes are the same. The original code using compareTo is correct for establishing a stable ordering based on node IDs.

Low

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for f8996cc: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working distributed framework

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] auto_expand_replicas creating more replicas than shard allocation awareness allows

1 participant