[Feature Request] Paginating _wlm/stats API #17592
Labels
enhancement
Enhancement or improvement to existing feature or request
Search
Search query, autocomplete ...etc
Is your feature request related to a problem? Please describe
The current _wlm/stats API in OpenSearch provides query group statistics across nodes in a single response, which scales poorly as cluster size increases. Similar to _cat APIs (e.g., _cat/indices, _cat/shards), this API suffers from large response sizes, high latency, and increased CPU/memory consumption. This makes it difficult for users to efficiently retrieve and process query group statistics, especially in large clusters.
The need for pagination arises to:
The issues and approaches discussed in the following OpenSearch GitHub issues are particularly relevant:
OpenSearch Issue #14257: Discusses pagination for _cat APIs, highlighting the impact of large responses on cluster performance.
OpenSearch Issue #15014: Tracks the introduction of _list APIs to replace _cat APIs, ensuring efficient pagination with next_token.
OpenSearch Issue #14258: Discusses pagination strategies, emphasizing deterministic sorting keys for stable pagination behavior.
Describe the solution you'd like
To address the issues of large response sizes and high resource consumption in
_wlm/stats
, we propose introducing a new API endpoint (/_list/wlm_stats
) with token-based pagination. This follows the approach used in OpenSearch Issue #14257 and OpenSearch Issue #15014, where_list
APIs were introduced for paginating large_cat
responses.Key Features
next_token
): Users can fetch query group statistics in smaller chunks, reducing resource consumption._cat
APIs, making it easy to read and process.Sorting Options
Since CPU and memory usage fluctuate frequently, sorting by these values is not supported because it would cause inconsistent pagination results. Instead, sorting will be restricted to stable attributes:
Example API Calls
Fetch First Page (Sorted by Query Group)
Returns results grouped by Query Group, making it easier to analyze workload performance.
Fetch First Page (Sorted by Node ID)
Sorts results by Node ID, providing a stable, structured overview.
Fetch Next Page
Uses
next_token
to fetch the next 50 results in a stable order.Related component
Search
Describe alternatives you've considered
An alternative solution is to enhance the existing _wlm/stats API with filtering options, ensuring that only the most relevant statistics are retrieved.
Key Features
Example API Calls
Fetch Nodes with CPU Usage Above 50%
GET/_wlm/stats?cpu_threshold=50
Returns only nodes consuming more than 50% CPU.
Fetch Nodes with High Memory Usage
GET/_wlm/stats?memory_threshold=70
Retrieves only nodes using more than 70% memory.
Fetch Query Groups for a Specific Node
GET/_wlm/stats?node_id=jPPwGjW-TA2NZB6Gn7RZtg
Returns query group statistics for the given node.
Additional context
No response
The text was updated successfully, but these errors were encountered: