[RFC] Neural Plugin Stats API #1196

q-andy · 2025-02-25T19:23:31Z

[RFC] Neural Search Stats API

Problem

The neural search plugin provides users a wide variety of features to power full text semantic search, such as neural search, hybrid search, and a variety other ML applications.

To understand how the neural plugin is being used, cluster managers may want additional insight on things like which processors have the most executions or which hybrid search techniques are being employed most frequently. However, the only way to gather this information currently is through the OpenSearch core node info API which lacks granular, plugin-specific information.

Proposal

In this RFC we propose creating a /stats API for neural search to provide the foundation to allow cluster managers to monitor adoption and operational info specific to the neural search plugin. This includes an optionally enabled backend to record and manage stat information, and an API to retrieve the information on demand. The purpose of this API is to provide high level information that cluster managers can use to observe usage trends over time.

This will be an opt-in feature. By default, stat collection will be disabled, and can be configured via cluster setting.

After the initial implementation is complete, our goal is to create documentation and drive new and existing features to onboard statistics they want to track onto this system. This may include:

Existing features

Hybrid search (e.g. hybrid queries, normalization processor executions, breakdowns by normalization technique, combination technique) [FEATURE] Create stats api to capture hybrid query stats #1146
Neural search (e.g. neural queries, text embedding processor executions) [FEATURE]Add Metrics for Neural Search Usage #1104
Neural sparse search (e.g. neural sparse queries, sparse encoder processor executions, breakdowns by prune types, two phase processor executions)

Upcoming features

[RFC] OpenSearch Semantic Sentence Highlighting #1175 Neural sentence highlighter
[RFC] Optimizing Text Embedding Processor #1138 Text embedding processor existing embedding optimization
[RFC] Common Filter Support for Hybrid Query Sub-Queries #1135 Push down filter for hybrid sub queries
[PROPOSAL] Neural Search field type #803 Neural field type for semantic search

Requirements

Stats should track simple events and state inside the code
1. Event stats: Counter stats that increase with code events. Singled valued, numeric, node level counters.
  1. e.g. how many RRFProcessor executions have there been in the last 5 minutes?
  2. e.g. have there been RRFProcessor executions in the past 24 hours?
2. State stats: Stats on existing cluster info that are determined on demand. Cluster level counters.
  1. e.g. How many NormalizationProcessors using arithmetic_mean does the cluster have in search pipelines?
Recording stats should be declarative and easy to implement
1. Define a stat name/path, call a method to increment it, and it should available in the API automatically
Stats should have minimal effect on performance
1. Recording and storing stats should have minimal memory and CPU footprint
Stats should have single unified API for retrieval
1. A single API provides all stat info, eliminating the need to call multiple APIs and aggregate data.
2. Stats can have “stat metadata” with more information about stat values, such as value type or last active event
3. Users should be able to filter nodes or stat names, or return the response flat JSON
Cluster managers should have flexibility to configure stat collection for their clusters
1. For users who don’t require stats, there should be option to disable stats entirely to prevent additional resource consumption

High level flow

Event-based stats
- Event stats are recorded in code at a node level (processor executions, documents ingested, etc)
- When an API call is made, all node-level maps are fetched via transport action and returned in the response.
State stats
- State stats are defined by helper functions that populate state stat values
- When an API call is made, the functions are invoked and the information is added to the response on demand

Event information is collected as is occurs

Stats API call fetches and returns stats

API Model

API	Method	Status	Mutating or Non-Mutating	Functionality
/_plugins/_neural/stats	GET	New	Non-Mutating	Retrieves stat counters from nodes and returns them in response

Path Parameters

nodes: specify node ids to retrieve stats from (default all)
stats: specify stat names to retrieve (default all)

Query Parameters

include_metadata: boolean, include recent_interval/stat_type/minutes_since_last_event (default false)
flatten: boolean, flatten the JSON response (default false)

Example calls

GET /_plugins/_neural/stats
GET /_plugins/_neural/stats/include_metadata=false
GET /_plugins/_neural/<node_id>/stats/<stat_name>?include_metadata=true&flatten=true

Cluster level setting to disable Stats API/Collection

PUT /_cluster/settings
{
    "persistent" : {
        "plugins.neural_search.stats_enabled" : "false" // default false
    }
}

Get stats without metadata

GET /_plugins/_neural/stats/

Example Response:

{
  "cluster_name": "integTest",
  "cluster_version": "3.0.0",
  "nodes": {
    "node_1": {
      "normalization_processor": {
        "normalization_processor_executions": 100,
        "normalization_technique_l2_executions": 80,
        "normalization_technique_z_score_executions": 20
      }
    }
  },
  "all_nodes": { ... },
  "state": {
    "normalization_processor": {
      "normalization_processors_in_pipelines_count": 3,
      "normalization_technique_z_score_in_pipelines_count": 1,
      "normalization_technique_l2_in_pipelines_count": 2
    }
  }
}

Get stats with metadata

GET /_plugins/_neural/stats&include_metadata=true

Example Response:

{
    "cluster_name": "integTest",
    "cluster_version": "3.0.0",
    "stats_enabled": true,
    "nodes": {
        "node_1": {                
            "processors": {
                "normalization": {
                    "normalization_processor_executions": {
                        value: 100,
                        type: "event_counter",
                        trailing_interval_value: 9,
                        minutes_since_last_event: 5
                    },
                    "normalization_technique_l2_executions": {
                        value: 80,
                        type: "event_counter",
                        trailing_interval_value: 9,
                        minutes_since_last_event: 5
                    },
                    "normalization_technique_z_score_executions": {
                        value: 20,
                        type: "event_counter",
                        trailing_interval_value: 0,
                        minutes_since_last_event: 9999                    
                    }
                }
            }
        }
    },
    "all_nodes": {
       // ...
    }
    "state": {
        "processors": {
            "normalization": {
                "normalization_processors_in_pipelines_count": {
                    value: 3,
                    type: "state_counter"
                },
                "normalization_technique_z_score_in_pipelines_count": {
                    value: 1,
                    type: "state_counter"
                },
                "normalization_technique_l2_in_pipelines_count": {
                    value: 2,
                    type: "state_counter"
                }
            }
        }
    }
}

Low Level Design

StatName enums are enums used to uniquely identify stats. Each stat name maps directly to a single stat
- Stat Name - uniquely identify the stat, will be used for filtering in user request
- Response Path - the location in the response that the json will be located.
- StatType is another Enum used by StatNames to define what kind of
Stat objects are objects that store information about a stat, including the stat value itself, + metadata

// Declare enums
enum EventStatName {
   NORMALIZATION_PROCESSOR_EXECUTIONS(
        "normalization_processor_executions", // stat name
        "processors.normalization", // response path
        EventStatType.TIMESTAMPED_EVENT_COUNTER)
   ;
    
   // ...
}

enum StateStatName {
   NORMALIZATION_PROCESSORS_IN_PIPELINES(
        "normalization_processors_in_pipelines", 
        "processors.normalization",
        StateStatType.STATE_COUNTER)
   ;
    
   // ...
}

Response

{
    // Event stats are calculated on node level
    "nodes": {
        "data_node_1_asdjfalksdjfjasf": {
            "processors": {
                "normalization": {
                    "normalization_processor_executions": <value>
                }
            }
        }
    }
    // State stats are calculated on cluster level
    "state": {
        "processors": {
            "normalization": {
                "normalization_processors_in_pipelines": <value>
            }
        }
    }
}

Event Stats

Stat Model

Value: the cumulative count of the recorded event since nodes started
- Storing the total allows for caller to manage delta calculations on their own
- This is the format value used by other OpenSearch stat APIs
Type: a type hint of the kind of value stored, used to help the Auto BR policy parse the response dynamically
Recent trailing time interval: A trailing count of the event in a the past 5 minutes
- This value will not be a perfect accurate count. The goal is to have a window into the overall trend of event statistics in a rolling time window, which can be used to observe broader usage patterns.
Last event time: a timestamp of the last recorded event

State Stats

Calculated on demand at call time, so less stat metadata data is needed

Event Stat Flow

Overall increment flow

Recording Stats

Static method increment() call using StatName as a parameter
Finds Stat by StatName in the map
Calls the eventStat.incremen``t()
1. Stat object handles updating count, updating time interval, and timestamp

Getting Stats

After API call, transport action is sent to each node
On each node, EventStatsManager.getStatsData() is called to get the stat data based on the current store
This map is returned from each node

Tracking recent time interval

High level diagram:

Implementation
(see below for example)

Option 1: Array with time buckets approach (preferred)
1. Hold a fixed length array with all buckets labelled by time (e.g. for 5 minutes, have a 6 element array, 5 past minutes + 1 current minute)
2. Current bucket to increment is determined by system time in minutes % number of buckets
  1. At increment time, “current” bucket is accessed, if up to date, it’s incremented
  2. If it’s out of date, it will be reset with the new timestamp and overwritten
3. To get values, iterate through all buckets, exclude expired buckets, and sum
4. Pros:
  1. Constant time reads/writes
  2. Bucket rotation is determined at increment time, if there are no increments, there is no performance overhead for events
5. Cons
  1. Difficult to extend with configurability, need to reinitialize fixed length arrays
Option 2: Scheduled rotating queue approach
1. Current bucket to increment is maintained by atomic reference
2. Hold a queue store for past buckets
3. At scheduled time intervals the current bucket is automatically rotated into the queue and the last bucket is popped
4. To get values, iterate through all buckets and sum
5. Pros:
  1. Simpler mental model
  2. Easier to add configurability for interval size
6. Cons:
  1. Need to manage scheduling system, leads to time desync issues
    1. Cannot guarentee perfect time aligned scheduled executions, which can lead to bucket time drift
    2. Performance overhead for buckets that do not need rotation, e.g. if there is a stat with no events, the scheduled executions would still run
  2. Need to manage atomic reference to current bucket

The following design will assume using option 1. Each event stat manages its own set of buckets.

Update existing bucket happy case

Increment Call

Increments the total value
Updates timestamp for time stamped stats
Determine which time bucket to access

// Get current time
long now = System.currentTimeMillis();

// Round down to minute
long currentBucketTime = now - (now % BUCKET_INTERVAL_MS); 

// Determine bucket index
int bucketIndex = (now // BUCKET_INTERVAL_MS) % BUCKETS_COUNT; 

// Get bucket 
Bucket bucket = buckets[bucketIndex];

// Check timestamp
if (bucket timestamp matches current timestamp) {
    // Increment bucket
    bucket.increment();
}

Get recent interval call

Get all buckets
Sum all buckets where bucket time follows conditions:
1. Bucket time >= 5 minutes ago (rounded down)
2. Bucket time < 1 minute ago (rounded down)

Update new bucket

Before Update

After Update

Increment call is same as above, except

Determine which time bucket to access
- Same as above case, except:
- If bucket time is out of date, overwrite bucket time with current time and reset counter
- then Increment
getRecentInterval() call is same as above

Updating latest event timestamp

What format to return timestamp?

Option 1: Relative timestamp (preferred)
1. e.g. “minutes since last event”
2. Each stat on the node stores the most recent event timestamp locally
3. When serialization call is made, the node uses its current system time to calculate a relative time (minutes_since_last_event) based on the most recent event
4. Pros:
  1. Calculating node relative time minimizes impact of cluster time desync issues. Each node only compares to itself.
5. Cons
  1. Timestamp will be relative to api call time rather than absolute
Option 2: Unix timestamp
1. Each stat on the node stores the most recent event unix timestamp
2. When serialization call is made, that timestamp is returned as is
3. Pros:
  1. Simple to return a single timestamp, absolute time value has more flexibility to be used by caller
4. Cons
  1. Nodes can have clock drift and face time sync issues, requires additional cluster level clock sync to ensure system time on a node is consistent across cluster

State Stat Flow

On coordinator node level , API request to get state stats
Create a new map to store state stat info
Call helper functions to get state stats
1. Helper functions fetches info and update map with calculated values
... more helper calls
... more helper calls
... more helper calls
Map is formatted to match user request
Return map with info to

Next steps

As mentioned above, our next steps after the framework is implemented is to onboarding existing and new feature use cases onto the system.

In the future, as we add more complex stat use cases such as plugin health and operational monitoring, we also consider the following future enhancements

Configurable time interval via cluster level setting
- Custom time interval length (e.g. past 5 minutes, past 10 minutes, past 30 minutes)
- Custom time bucket length (e.g. 1 minute granularity, 5 minute granularity)
- Note: setting is changed at runtime, it would require resetting all event stats (reinitializing time bucket array) leading to data loss.
State stat caching
- If there any expensive on-demand calculations to compute state stats in the future, we can implement a caching system to reduce the impact of repeated calls
Detailed operational statistics
- More detailed information for cluster health diagnostics, such as processor

The text was updated successfully, but these errors were encountered:

github-actions bot added the untriaged label Feb 25, 2025

heemin32 added this to OpenSearch Roadmap Mar 3, 2025

github-project-automation bot moved this to New in OpenSearch Roadmap Mar 3, 2025

heemin32 added v3.0.0 v3.0.0 and removed untriaged labels Mar 3, 2025

heemin32 assigned q-andy and unassigned q-andy Mar 3, 2025

q-andy mentioned this issue Mar 4, 2025

[Feature branch] Add Neural Stats API #1208

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Neural Plugin Stats API #1196

[RFC] Neural Plugin Stats API #1196

q-andy commented Feb 25, 2025

[RFC] Neural Plugin Stats API #1196

[RFC] Neural Plugin Stats API #1196

Comments

q-andy commented Feb 25, 2025

[RFC] Neural Search Stats API

Problem

Proposal

Requirements

High level flow

Event information is collected as is occurs

Stats API call fetches and returns stats

API Model

Get stats without metadata

Get stats with metadata

Low Level Design

Event Stats

Stat Model

State Stats

Event Stat Flow

Overall increment flow

Tracking recent time interval

Update existing bucket happy case

Update new bucket

Updating latest event timestamp

State Stat Flow

Next steps