Fix auto_expand_replicas creating more replicas than awareness allows#21752
Fix auto_expand_replicas creating more replicas than awareness allows#21752cuonghm2809 wants to merge 1 commit into
Conversation
Override shouldAutoExpandToNode() in AwarenessAllocationDecider to limit the number of nodes counted during auto-expand replica calculation based on awareness constraints. Previously, awareness was not consulted when computing the desired replica count, causing auto_expand_replicas to create more replicas than zone/rack awareness could place, resulting in permanently yellow clusters. The fix computes the maximum achievable shard copies considering both the awareness attribute distribution and the auto_expand max setting, then caps the per-attribute-value node count accordingly. Resolves opensearch-project#2984 Signed-off-by: Cuong Ha <cuong.ha@optimizely.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
|
❌ Gradle check result for f8996cc: null Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Summary
Fixes #2984
When
auto_expand_replicasis enabled on indices (e.g., system indices with1-20) and zone/rack awareness is configured, the replica count expands beyond what awareness constraints allow. This causes replicas to remain permanently unassigned, resulting in a yellow cluster that cannot self-heal.Root Cause
AutoExpandReplicas.getDesiredNumberOfReplicas()callsshouldAutoExpandToNode()on each decider to count eligible nodes. However,AwarenessAllocationDeciderdid not override this method — it returned the defaultDecision.ALWAYSfor every node. This meant awareness constraints were completely ignored when computing the desired replica count, even though those same constraints later blocked the allocation incanAllocate().Fix
Override
shouldAutoExpandToNode()inAwarenessAllocationDeciderto limit the number of nodes counted per awareness attribute value (zone/rack) based on:computeMaxAchievableCopies(), which finds the largest S wheresum(min(nodesPerValue_i, ceil(S/K))) >= Sauto_expand_replicasmax as an upper bound so the computed node count is never reduced by the max setting to a value that awareness can't handleauto_expand_replicas: 0-allcontinues to bypass awareness (tested and intentional), and indices withoutauto_expand_replicasare not affected at allExamples
0-201-20(issue #2984)0-200-51-20number_of_replicas: 10(no auto-expand)Test plan
testAutoExpandReplicasWithAwarenessEqualZones— 3 equal zones, auto-expand0-20testAutoExpandReplicasWithAwarenessUnequalZones— 2 unequal zones (2 vs 5 nodes)testAutoExpandReplicasWithForcedAwarenessAndEmptyZone— forced zone with no nodestestAutoExpandReplicasWithManyRackIds— 7 rack_ids across 9 pods (issue [BUG] auto_expand_replicas creating more replicas than shard allocation awareness allows #2984 scenario)testAutoExpandReplicasWithUnevenRacksAndExplicitMax— uneven racks with explicit auto-expand maxtestAutoExpandReplicasWithAwarenessComputeMaxCopies— unit tests forcomputeMaxAchievableCopieshelpertestIgnoredByAutoExpandReplicasToAll— existing test,0-allstill bypasses awarenessAwarenessAllocationTestspass (no regressions)AutoExpandReplicasTestspassAllocationDecidersTestspass