You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The neural search plugin provides users a wide variety of features to power full text semantic search, such as neural search, hybrid search, and a variety other ML applications.
To understand how the neural plugin is being used, cluster managers may want additional insight on things like which processors have the most executions or which hybrid search techniques are being employed most frequently. However, the only way to gather this information currently is through the OpenSearch core node info API which lacks granular, plugin-specific information.
Proposal
In this RFC we propose creating a /stats API for neural search to provide the foundation to allow cluster managers to monitor adoption and operational info specific to the neural search plugin. This includes an optionally enabled backend to record and manage stat information, and an API to retrieve the information on demand. The purpose of this API is to provide high level information that cluster managers can use to observe usage trends over time.
This will be an opt-in feature. By default, stat collection will be disabled, and can be configured via cluster setting.
After the initial implementation is complete, our goal is to create documentation and drive new and existing features to onboard statistics they want to track onto this system. This may include:
Stats should track simple events and state inside the code
Event stats: Counter stats that increase with code events. Singled valued, numeric, node level counters.
e.g. how many RRFProcessor executions have there been in the last 5 minutes?
e.g. have there been RRFProcessor executions in the past 24 hours?
State stats: Stats on existing cluster info that are determined on demand. Cluster level counters.
e.g. How many NormalizationProcessors using arithmetic_mean does the cluster have in search pipelines?
Recording stats should be declarative and easy to implement
Define a stat name/path, call a method to increment it, and it should available in the API automatically
Stats should have minimal effect on performance
Recording and storing stats should have minimal memory and CPU footprint
Stats should have single unified API for retrieval
A single API provides all stat info, eliminating the need to call multiple APIs and aggregate data.
Stats can have “stat metadata” with more information about stat values, such as value type or last active event
Users should be able to filter nodes or stat names, or return the response flat JSON
Cluster managers should have flexibility to configure stat collection for their clusters
For users who don’t require stats, there should be option to disable stats entirely to prevent additional resource consumption
High level flow
Event-based stats
Event stats are recorded in code at a node level (processor executions, documents ingested, etc)
When an API call is made, all node-level maps are fetched via transport action and returned in the response.
State stats
State stats are defined by helper functions that populate state stat values
When an API call is made, the functions are invoked and the information is added to the response on demand
Event information is collected as is occurs
Stats API call fetches and returns stats
API Model
API
Method
Status
Mutating or Non-Mutating
Functionality
/_plugins/_neural/stats
GET
New
Non-Mutating
Retrieves stat counters from nodes and returns them in response
Path Parameters
nodes: specify node ids to retrieve stats from (default all)
stats: specify stat names to retrieve (default all)
Query Parameters
include_metadata: boolean, include recent_interval/stat_type/minutes_since_last_event (default false)
flatten: boolean, flatten the JSON response (default false)
Example calls
GET /_plugins/_neural/stats
GET /_plugins/_neural/stats/include_metadata=false
GET /_plugins/_neural/<node_id>/stats/<stat_name>?include_metadata=true&flatten=true
Cluster level setting to disable Stats API/Collection
{
// Event stats are calculated on node level
"nodes": {
"data_node_1_asdjfalksdjfjasf": {
"processors": {
"normalization": {
"normalization_processor_executions": <value>
}
}
}
}
// State stats are calculated on cluster level
"state": {
"processors": {
"normalization": {
"normalization_processors_in_pipelines": <value>
}
}
}
}
Event Stats
Stat Model
Value: the cumulative count of the recorded event since nodes started
Storing the total allows for caller to manage delta calculations on their own
This is the format value used by other OpenSearch stat APIs
Type: a type hint of the kind of value stored, used to help the Auto BR policy parse the response dynamically
Recent trailing time interval: A trailing count of the event in a the past 5 minutes
This value will not be a perfect accurate count. The goal is to have a window into the overall trend of event statistics in a rolling time window, which can be used to observe broader usage patterns.
Last event time: a timestamp of the last recorded event
State Stats
Calculated on demand at call time, so less stat metadata data is needed
Event Stat Flow
Overall increment flow
Recording Stats
Static method increment() call using StatName as a parameter
Finds Stat by StatName in the map
Calls the eventStat.incremen``t()
Stat object handles updating count, updating time interval, and timestamp
Getting Stats
After API call, transport action is sent to each node
On each node, EventStatsManager.getStatsData() is called to get the stat data based on the current store
This map is returned from each node
Tracking recent time interval
High level diagram:
Implementation
(see below for example)
Option 1: Array with time buckets approach (preferred)
Hold a fixed length array with all buckets labelled by time (e.g. for 5 minutes, have a 6 element array, 5 past minutes + 1 current minute)
Current bucket to increment is determined by system time in minutes % number of buckets
At increment time, “current” bucket is accessed, if up to date, it’s incremented
If it’s out of date, it will be reset with the new timestamp and overwritten
To get values, iterate through all buckets, exclude expired buckets, and sum
Pros:
Constant time reads/writes
Bucket rotation is determined at increment time, if there are no increments, there is no performance overhead for events
Cons
Difficult to extend with configurability, need to reinitialize fixed length arrays
Option 2: Scheduled rotating queue approach
Current bucket to increment is maintained by atomic reference
Hold a queue store for past buckets
At scheduled time intervals the current bucket is automatically rotated into the queue and the last bucket is popped
To get values, iterate through all buckets and sum
Pros:
Simpler mental model
Easier to add configurability for interval size
Cons:
Need to manage scheduling system, leads to time desync issues
Cannot guarentee perfect time aligned scheduled executions, which can lead to bucket time drift
Performance overhead for buckets that do not need rotation, e.g. if there is a stat with no events, the scheduled executions would still run
Need to manage atomic reference to current bucket
The following design will assume using option 1. Each event stat manages its own set of buckets.
Update existing bucket happy case
Increment Call
Increments the total value
Updates timestamp for time stamped stats
Determine which time bucket to access
// Get current time
long now = System.currentTimeMillis();
// Round down to minute
long currentBucketTime = now - (now % BUCKET_INTERVAL_MS);
// Determine bucket index
int bucketIndex = (now // BUCKET_INTERVAL_MS) % BUCKETS_COUNT;
// Get bucket
Bucket bucket = buckets[bucketIndex];
// Check timestamp
if (bucket timestamp matches current timestamp) {
// Increment bucket
bucket.increment();
}
Get recent interval call
Get all buckets
Sum all buckets where bucket time follows conditions:
Bucket time >= 5 minutes ago (rounded down)
Bucket time < 1 minute ago (rounded down)
Update new bucket
Before Update
After Update
Increment call is same as above, except
Determine which time bucket to access
Same as above case, except:
If bucket time is out of date, overwrite bucket time with current time and reset counter
then Increment
getRecentInterval() call is same as above
Updating latest event timestamp
What format to return timestamp?
Option 1: Relative timestamp (preferred)
e.g. “minutes since last event”
Each stat on the node stores the most recent event timestamp locally
When serialization call is made, the node uses its current system time to calculate a relative time (minutes_since_last_event) based on the most recent event
Pros:
Calculating node relative time minimizes impact of cluster time desync issues. Each node only compares to itself.
Cons
Timestamp will be relative to api call time rather than absolute
Option 2: Unix timestamp
Each stat on the node stores the most recent event unix timestamp
When serialization call is made, that timestamp is returned as is
Pros:
Simple to return a single timestamp, absolute time value has more flexibility to be used by caller
Cons
Nodes can have clock drift and face time sync issues, requires additional cluster level clock sync to ensure system time on a node is consistent across cluster
State Stat Flow
On coordinator node level , API request to get state stats
Create a new map to store state stat info
Call helper functions to get state stats
Helper functions fetches info and update map with calculated values
... more helper calls
... more helper calls
... more helper calls
Map is formatted to match user request
Return map with info to
Next steps
As mentioned above, our next steps after the framework is implemented is to onboarding existing and new feature use cases onto the system.
In the future, as we add more complex stat use cases such as plugin health and operational monitoring, we also consider the following future enhancements
Configurable time interval via cluster level setting
Custom time interval length (e.g. past 5 minutes, past 10 minutes, past 30 minutes)
Note: setting is changed at runtime, it would require resetting all event stats (reinitializing time bucket array) leading to data loss.
State stat caching
If there any expensive on-demand calculations to compute state stats in the future, we can implement a caching system to reduce the impact of repeated calls
Detailed operational statistics
More detailed information for cluster health diagnostics, such as processor
The text was updated successfully, but these errors were encountered:
[RFC] Neural Search Stats API
Problem
The neural search plugin provides users a wide variety of features to power full text semantic search, such as neural search, hybrid search, and a variety other ML applications.
To understand how the neural plugin is being used, cluster managers may want additional insight on things like which processors have the most executions or which hybrid search techniques are being employed most frequently. However, the only way to gather this information currently is through the OpenSearch core node info API which lacks granular, plugin-specific information.
Proposal
In this RFC we propose creating a /stats API for neural search to provide the foundation to allow cluster managers to monitor adoption and operational info specific to the neural search plugin. This includes an optionally enabled backend to record and manage stat information, and an API to retrieve the information on demand. The purpose of this API is to provide high level information that cluster managers can use to observe usage trends over time.
This will be an opt-in feature. By default, stat collection will be disabled, and can be configured via cluster setting.
After the initial implementation is complete, our goal is to create documentation and drive new and existing features to onboard statistics they want to track onto this system. This may include:
Existing features
Upcoming features
Requirements
RRFProcessor
executions have there been in the last 5 minutes?RRFProcessor
executions in the past 24 hours?NormalizationProcessors
usingarithmetic_mean
does the cluster have in search pipelines?High level flow
Event information is collected as is occurs
Stats API call fetches and returns stats
API Model
Path Parameters
nodes
: specify node ids to retrieve stats from (default all)stats
: specify stat names to retrieve (default all)Query Parameters
include_metadata
: boolean, include recent_interval/stat_type/minutes_since_last_event (default false)flatten
: boolean, flatten the JSON response (default false)Example calls
Cluster level setting to disable Stats API/Collection
Get stats without metadata
Example Response:
Get stats with metadata
Example Response:
Low Level Design
Response
Event Stats
Stat Model
State Stats
Event Stat Flow
Overall increment flow
Recording Stats
increment()
call using StatName as a parametereventStat.incremen``t()
Getting Stats
EventStatsManager.getStatsData()
is called to get the stat data based on the current storeTracking recent time interval
High level diagram:
Implementation
(see below for example)
The following design will assume using option 1. Each event stat manages its own set of buckets.
Update existing bucket happy case
Increment Call
Get recent interval call
Update new bucket
Before Update
After Update
Increment call is same as above, except
Updating latest event timestamp
What format to return timestamp?
State Stat Flow
Next steps
As mentioned above, our next steps after the framework is implemented is to onboarding existing and new feature use cases onto the system.
In the future, as we add more complex stat use cases such as plugin health and operational monitoring, we also consider the following future enhancements
The text was updated successfully, but these errors were encountered: