Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Neural Plugin Stats API #1196

Open
q-andy opened this issue Feb 25, 2025 · 0 comments · Fixed by #1208
Open

[RFC] Neural Plugin Stats API #1196

q-andy opened this issue Feb 25, 2025 · 0 comments · Fixed by #1208
Assignees
Labels
v3.0.0 v3.0.0

Comments

@q-andy
Copy link

q-andy commented Feb 25, 2025

[RFC] Neural Search Stats API

Problem

The neural search plugin provides users a wide variety of features to power full text semantic search, such as neural search, hybrid search, and a variety other ML applications.

To understand how the neural plugin is being used, cluster managers may want additional insight on things like which processors have the most executions or which hybrid search techniques are being employed most frequently. However, the only way to gather this information currently is through the OpenSearch core node info API which lacks granular, plugin-specific information.

Proposal

In this RFC we propose creating a /stats API for neural search to provide the foundation to allow cluster managers to monitor adoption and operational info specific to the neural search plugin. This includes an optionally enabled backend to record and manage stat information, and an API to retrieve the information on demand. The purpose of this API is to provide high level information that cluster managers can use to observe usage trends over time.

This will be an opt-in feature. By default, stat collection will be disabled, and can be configured via cluster setting.

After the initial implementation is complete, our goal is to create documentation and drive new and existing features to onboard statistics they want to track onto this system. This may include:

Existing features

Upcoming features

Requirements

  1. Stats should track simple events and state inside the code
    1. Event stats: Counter stats that increase with code events. Singled valued, numeric, node level counters.
      1. e.g. how many RRFProcessor executions have there been in the last 5 minutes?
      2. e.g. have there been RRFProcessor executions in the past 24 hours?
    2. State stats: Stats on existing cluster info that are determined on demand. Cluster level counters.
      1. e.g. How many NormalizationProcessors using arithmetic_mean does the cluster have in search pipelines?
  2. Recording stats should be declarative and easy to implement
    1. Define a stat name/path, call a method to increment it, and it should available in the API automatically
  3. Stats should have minimal effect on performance
    1. Recording and storing stats should have minimal memory and CPU footprint
  4. Stats should have single unified API for retrieval
    1. A single API provides all stat info, eliminating the need to call multiple APIs and aggregate data.
    2. Stats can have “stat metadata” with more information about stat values, such as value type or last active event
    3. Users should be able to filter nodes or stat names, or return the response flat JSON
  5. Cluster managers should have flexibility to configure stat collection for their clusters
    1. For users who don’t require stats, there should be option to disable stats entirely to prevent additional resource consumption

High level flow

  • Event-based stats
    • Event stats are recorded in code at a node level (processor executions, documents ingested, etc)
    • When an API call is made, all node-level maps are fetched via transport action and returned in the response.
  • State stats
    • State stats are defined by helper functions that populate state stat values
    • When an API call is made, the functions are invoked and the information is added to the response on demand

Event information is collected as is occurs

Stats API call fetches and returns stats

Image

API Model

API Method Status Mutating or Non-Mutating Functionality
/_plugins/_neural/stats GET New Non-Mutating Retrieves stat counters from nodes and returns them in response

Path Parameters

  • nodes: specify node ids to retrieve stats from (default all)
  • stats: specify stat names to retrieve (default all)

Query Parameters

  • include_metadata: boolean, include recent_interval/stat_type/minutes_since_last_event (default false)
  • flatten: boolean, flatten the JSON response (default false)

Example calls

GET /_plugins/_neural/stats
GET /_plugins/_neural/stats/include_metadata=false
GET /_plugins/_neural/<node_id>/stats/<stat_name>?include_metadata=true&flatten=true

Cluster level setting to disable Stats API/Collection

PUT /_cluster/settings
{
    "persistent" : {
        "plugins.neural_search.stats_enabled" : "false" // default false
    }
}

Get stats without metadata

GET /_plugins/_neural/stats/

Example Response:

{
  "cluster_name": "integTest",
  "cluster_version": "3.0.0",
  "nodes": {
    "node_1": {
      "normalization_processor": {
        "normalization_processor_executions": 100,
        "normalization_technique_l2_executions": 80,
        "normalization_technique_z_score_executions": 20
      }
    }
  },
  "all_nodes": { ... },
  "state": {
    "normalization_processor": {
      "normalization_processors_in_pipelines_count": 3,
      "normalization_technique_z_score_in_pipelines_count": 1,
      "normalization_technique_l2_in_pipelines_count": 2
    }
  }
}

Get stats with metadata

GET /_plugins/_neural/stats&include_metadata=true

Example Response:

{
    "cluster_name": "integTest",
    "cluster_version": "3.0.0",
    "stats_enabled": true,
    "nodes": {
        "node_1": {                
            "processors": {
                "normalization": {
                    "normalization_processor_executions": {
                        value: 100,
                        type: "event_counter",
                        trailing_interval_value: 9,
                        minutes_since_last_event: 5
                    },
                    "normalization_technique_l2_executions": {
                        value: 80,
                        type: "event_counter",
                        trailing_interval_value: 9,
                        minutes_since_last_event: 5
                    },
                    "normalization_technique_z_score_executions": {
                        value: 20,
                        type: "event_counter",
                        trailing_interval_value: 0,
                        minutes_since_last_event: 9999                    
                    }
                }
            }
        }
    },
    "all_nodes": {
       // ...
    }
    "state": {
        "processors": {
            "normalization": {
                "normalization_processors_in_pipelines_count": {
                    value: 3,
                    type: "state_counter"
                },
                "normalization_technique_z_score_in_pipelines_count": {
                    value: 1,
                    type: "state_counter"
                },
                "normalization_technique_l2_in_pipelines_count": {
                    value: 2,
                    type: "state_counter"
                }
            }
        }
    }
}

Low Level Design

  • StatName enums are enums used to uniquely identify stats. Each stat name maps directly to a single stat
    • Stat Name - uniquely identify the stat, will be used for filtering in user request
    • Response Path - the location in the response that the json will be located.
    • StatType is another Enum used by StatNames to define what kind of
  • Stat objects are objects that store information about a stat, including the stat value itself, + metadata

Image

// Declare enums
enum EventStatName {
   NORMALIZATION_PROCESSOR_EXECUTIONS(
        "normalization_processor_executions", // stat name
        "processors.normalization", // response path
        EventStatType.TIMESTAMPED_EVENT_COUNTER)
   ;
    
   // ...
}

enum StateStatName {
   NORMALIZATION_PROCESSORS_IN_PIPELINES(
        "normalization_processors_in_pipelines", 
        "processors.normalization",
        StateStatType.STATE_COUNTER)
   ;
    
   // ...
}

Response

{
    // Event stats are calculated on node level
    "nodes": {
        "data_node_1_asdjfalksdjfjasf": {
            "processors": {
                "normalization": {
                    "normalization_processor_executions": <value>
                }
            }
        }
    }
    // State stats are calculated on cluster level
    "state": {
        "processors": {
            "normalization": {
                "normalization_processors_in_pipelines": <value>
            }
        }
    }
}

Event Stats

Image

Stat Model

  • Value: the cumulative count of the recorded event since nodes started
    • Storing the total allows for caller to manage delta calculations on their own
    • This is the format value used by other OpenSearch stat APIs
  • Type: a type hint of the kind of value stored, used to help the Auto BR policy parse the response dynamically
  • Recent trailing time interval: A trailing count of the event in a the past 5 minutes
    • This value will not be a perfect accurate count. The goal is to have a window into the overall trend of event statistics in a rolling time window, which can be used to observe broader usage patterns.
  • Last event time: a timestamp of the last recorded event

State Stats

Image

  • Calculated on demand at call time, so less stat metadata data is needed

Event Stat Flow

Overall increment flow

Image

Recording Stats

  1. Static method increment() call using StatName as a parameter
  2. Finds Stat by StatName in the map
  3. Calls the eventStat.incremen``t()
    1. Stat object handles updating count, updating time interval, and timestamp

Getting Stats

  1. After API call, transport action is sent to each node
  2. On each node, EventStatsManager.getStatsData() is called to get the stat data based on the current store
  3. This map is returned from each node

Tracking recent time interval

High level diagram:

Image

Implementation
(see below for example)

  1. Option 1: Array with time buckets approach (preferred)
    1. Hold a fixed length array with all buckets labelled by time (e.g. for 5 minutes, have a 6 element array, 5 past minutes + 1 current minute)
    2. Current bucket to increment is determined by system time in minutes % number of buckets
      1. At increment time, “current” bucket is accessed, if up to date, it’s incremented
      2. If it’s out of date, it will be reset with the new timestamp and overwritten
    3. To get values, iterate through all buckets, exclude expired buckets, and sum
    4. Pros:
      1. Constant time reads/writes
      2. Bucket rotation is determined at increment time, if there are no increments, there is no performance overhead for events
    5. Cons
      1. Difficult to extend with configurability, need to reinitialize fixed length arrays
  2. Option 2: Scheduled rotating queue approach
    1. Current bucket to increment is maintained by atomic reference
    2. Hold a queue store for past buckets
    3. At scheduled time intervals the current bucket is automatically rotated into the queue and the last bucket is popped
    4. To get values, iterate through all buckets and sum
    5. Pros:
      1. Simpler mental model
      2. Easier to add configurability for interval size
    6. Cons:
      1. Need to manage scheduling system, leads to time desync issues
        1. Cannot guarentee perfect time aligned scheduled executions, which can lead to bucket time drift
        2. Performance overhead for buckets that do not need rotation, e.g. if there is a stat with no events, the scheduled executions would still run
      2. Need to manage atomic reference to current bucket

The following design will assume using option 1. Each event stat manages its own set of buckets.

Update existing bucket happy case

Image
Increment Call

  1. Increments the total value
  2. Updates timestamp for time stamped stats
  3. Determine which time bucket to access
// Get current time
long now = System.currentTimeMillis();

// Round down to minute
long currentBucketTime = now - (now % BUCKET_INTERVAL_MS); 

// Determine bucket index
int bucketIndex = (now // BUCKET_INTERVAL_MS) % BUCKETS_COUNT; 

// Get bucket 
Bucket bucket = buckets[bucketIndex];

// Check timestamp
if (bucket timestamp matches current timestamp) {
    // Increment bucket
    bucket.increment();
}

Get recent interval call

  1. Get all buckets
  2. Sum all buckets where bucket time follows conditions:
    1. Bucket time >= 5 minutes ago (rounded down)
    2. Bucket time < 1 minute ago (rounded down)

Update new bucket

Before Update

Image

After Update

Image

Increment call is same as above, except

  • Determine which time bucket to access
    • Same as above case, except:
    • If bucket time is out of date, overwrite bucket time with current time and reset counter
    • then Increment
  • getRecentInterval() call is same as above

Updating latest event timestamp

What format to return timestamp?

  1. Option 1: Relative timestamp (preferred)
    1. e.g. “minutes since last event”
    2. Each stat on the node stores the most recent event timestamp locally
    3. When serialization call is made, the node uses its current system time to calculate a relative time (minutes_since_last_event) based on the most recent event
    4. Pros:
      1. Calculating node relative time minimizes impact of cluster time desync issues. Each node only compares to itself.
    5. Cons
      1. Timestamp will be relative to api call time rather than absolute
  2. Option 2: Unix timestamp
    1. Each stat on the node stores the most recent event unix timestamp
    2. When serialization call is made, that timestamp is returned as is
    3. Pros:
      1. Simple to return a single timestamp, absolute time value has more flexibility to be used by caller
    4. Cons
      1. Nodes can have clock drift and face time sync issues, requires additional cluster level clock sync to ensure system time on a node is consistent across cluster

State Stat Flow

Image

  1. On coordinator node level , API request to get state stats
  2. Create a new map to store state stat info
  3. Call helper functions to get state stats
    1. Helper functions fetches info and update map with calculated values
  4. ... more helper calls
  5. ... more helper calls
  6. ... more helper calls
  7. Map is formatted to match user request
  8. Return map with info to

Next steps

As mentioned above, our next steps after the framework is implemented is to onboarding existing and new feature use cases onto the system.

In the future, as we add more complex stat use cases such as plugin health and operational monitoring, we also consider the following future enhancements

  • Configurable time interval via cluster level setting
    • Custom time interval length (e.g. past 5 minutes, past 10 minutes, past 30 minutes)
    • Custom time bucket length (e.g. 1 minute granularity, 5 minute granularity)
    • Note: setting is changed at runtime, it would require resetting all event stats (reinitializing time bucket array) leading to data loss.
  • State stat caching
    • If there any expensive on-demand calculations to compute state stats in the future, we can implement a caching system to reduce the impact of repeated calls
  • Detailed operational statistics
    • More detailed information for cluster health diagnostics, such as processor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v3.0.0 v3.0.0
Projects
Status: New
Development

Successfully merging a pull request may close this issue.

2 participants