Fix auto_expand_replicas creating more replicas than awareness allows by cuonghm2809 · Pull Request #21752 · opensearch-project/OpenSearch

cuonghm2809 · 2026-05-20T02:58:41Z

Summary

When auto_expand_replicas is enabled on indices (e.g., system indices with 1-20) and zone/rack awareness is configured, the replica count expands beyond what awareness constraints allow. This causes replicas to remain permanently unassigned, resulting in a yellow cluster that cannot self-heal.

Root Cause

AutoExpandReplicas.getDesiredNumberOfReplicas() calls shouldAutoExpandToNode() on each decider to count eligible nodes. However, AwarenessAllocationDecider did not override this method — it returned the default Decision.ALWAYS for every node. This meant awareness constraints were completely ignored when computing the desired replica count, even though those same constraints later blocked the allocation in canAllocate().

Fix

Override shouldAutoExpandToNode() in AwarenessAllocationDecider to limit the number of nodes counted per awareness attribute value (zone/rack) based on:

Awareness capacity — computes the maximum achievable shard copies across all attribute values using computeMaxAchievableCopies(), which finds the largest S where sum(min(nodesPerValue_i, ceil(S/K))) >= S
Auto-expand max setting — uses the auto_expand_replicas max as an upper bound so the computed node count is never reduced by the max setting to a value that awareness can't handle
Existing behavior preserved — auto_expand_replicas: 0-all continues to bypass awareness (tested and intentional), and indices without auto_expand_replicas are not affected at all

Examples

Scenario	Before	After
3 zones × 3 nodes, `0-20`	YELLOW (awareness blocks)	GREEN (8 replicas)
7 rack_ids, 9 pods, `1-20` (issue #2984)	YELLOW	GREEN (8 replicas)
3 forced zones (1 empty), 6 nodes, `0-20`	YELLOW	GREEN (3 replicas)
3 racks (5,1,1 nodes), `0-5`	YELLOW (2 unassigned)	GREEN (3 replicas)
7 racks, 40 nodes (uneven), `1-20`	YELLOW (1 unassigned)	GREEN (19 replicas)
Fixed `number_of_replicas: 10` (no auto-expand)	No change	No change

Test plan

Override shouldAutoExpandToNode() in AwarenessAllocationDecider to limit the number of nodes counted during auto-expand replica calculation based on awareness constraints. Previously, awareness was not consulted when computing the desired replica count, causing auto_expand_replicas to create more replicas than zone/rack awareness could place, resulting in permanently yellow clusters. The fix computes the maximum achievable shard copies considering both the awareness attribute distribution and the auto_expand max setting, then caps the per-attribute-value node count accordingly. Resolves opensearch-project#2984 Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>

github-actions · 2026-05-20T02:59:48Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review Possible Issue When `forcedAwarenessAttributes` contains an attribute not in `awarenessAttributes`, the code iterates over `awarenessAttributes` but accesses `forcedAwarenessAttributes` using the current attribute. If the forced attribute differs, `forcedValues` will be null and the forced values are silently ignored. This can cause incorrect `numberOfAttributeValues` and `maxCopies` calculations, leading to over-allocation or under-allocation of replicas. for (String awarenessAttribute : awarenessAttributes) { String nodeAttributeValue = node.getAttributes().get(awarenessAttribute); if (nodeAttributeValue == null) { return allocation.decision( Decision.NO, NAME, "node does not contain the awareness attribute [%s]; required attributes cluster setting [%s=%s]", awarenessAttribute, CLUSTER_ROUTING_ALLOCATION_AWARENESS_ATTRIBUTE_SETTING.getKey(), allocation.debugDecision() ? Strings.collectionToCommaDelimitedString(awarenessAttributes) : null ); } Map<String, Integer> nodeCountPerAttributeValue = new HashMap<>(); for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) { if (dataNode.isSearchNode()) { continue; } String attrValue = dataNode.getAttributes().get(awarenessAttribute); if (attrValue != null) { nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum); } } Set<String> allAttributeValues = new HashSet<>(nodeCountPerAttributeValue.keySet()); List<String> forcedValues = forcedAwarenessAttributes.get(awarenessAttribute); if (forcedValues != null) { allAttributeValues.addAll(forcedValues); } int numberOfAttributeValues = allAttributeValues.size(); if (numberOfAttributeValues <= 1) { continue; } Possible Issue The position calculation uses `compareTo` on node IDs to determine ordering. If node IDs are not lexicographically ordered in the same way as insertion order or cluster state order, nodes may be incorrectly excluded. This can cause replicas to be rejected even when capacity exists, or accepted when they should be rejected, leading to unbalanced allocation or unassigned shards. long position = 0; for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) { if (dataNode.isSearchNode()) { continue; } String attrValue = dataNode.getAttributes().get(awarenessAttribute); if (nodeAttributeValue.equals(attrValue) && dataNode.getId().compareTo(node.getId()) < 0) { position++; } }

github-actions · 2026-05-20T03:00:17Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
General	Optimize duplicate node iteration The loop iterates over all data nodes twice (once for counting, once for position calculation), which is inefficient for large clusters. Consider caching the filtered data nodes list or combining the operations to reduce iteration overhead. server/src/main/java/org/opensearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java [174-182] +List<DiscoveryNode> eligibleDataNodes = new ArrayList<>(); for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) { - if (dataNode.isSearchNode()) { - continue; - } - String attrValue = dataNode.getAttributes().get(awarenessAttribute); - if (attrValue != null) { - nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum); + if (!dataNode.isSearchNode()) { + eligibleDataNodes.add(dataNode); + String attrValue = dataNode.getAttributes().get(awarenessAttribute); + if (attrValue != null) { + nodeCountPerAttributeValue.merge(attrValue, 1, Integer::sum); + } } } Suggestion importance[1-10]: 5 __ Why: The suggestion correctly identifies that data nodes are iterated twice in the method, which could be optimized by caching. However, the impact is moderate as this is only called during allocation decisions, not in hot paths. The optimization would improve performance in large clusters but is not critical.	Low
General	Ensure stable node position ordering The position calculation uses string comparison on node IDs which may not provide stable ordering across cluster restarts if node IDs change. Consider using a more stable ordering mechanism or document this assumption clearly. server/src/main/java/org/opensearch/cluster/routing/allocation/decider/AwarenessAllocationDecider.java [203-212] long position = 0; for (DiscoveryNode dataNode : allocation.nodes().getDataNodes().values()) { if (dataNode.isSearchNode()) { continue; } String attrValue = dataNode.getAttributes().get(awarenessAttribute); - if (nodeAttributeValue.equals(attrValue) && dataNode.getId().compareTo(node.getId()) < 0) { - position++; + if (nodeAttributeValue.equals(attrValue)) { + int comparison = dataNode.getId().compareTo(node.getId()); + if (comparison < 0 \|\| (comparison == 0 && !dataNode.equals(node))) { + position++; + } } } Suggestion importance[1-10]: 2 __ Why: The suggestion misunderstands the code. The `improved_code` adds an unnecessary check `comparison == 0 && !dataNode.equals(node)` which would never be true since if `comparison == 0`, the IDs are equal and the nodes are the same. The original code using `compareTo` is correct for establishing a stable ordering based on node IDs.	Low

github-actions · 2026-05-20T05:01:03Z

❌ Gradle check result for f8996cc: null

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

cuonghm2809 requested a review from a team as a code owner May 20, 2026 02:58

github-actions Bot added bug Something isn't working distributed framework labels May 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix auto_expand_replicas creating more replicas than awareness allows#21752

Fix auto_expand_replicas creating more replicas than awareness allows#21752
cuonghm2809 wants to merge 1 commit into
opensearch-project:mainfrom
cuonghm2809:fix/auto-expand-replicas-awareness-2984

cuonghm2809 commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cuonghm2809 commented May 20, 2026

Summary

Root Cause

Fix

Examples

Test plan

Uh oh!

github-actions Bot commented May 20, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions Bot commented May 20, 2026

PR Code Suggestions ✨

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant