generated from amazon-archives/__template_Apache-2.0
-
Notifications
You must be signed in to change notification settings - Fork 617
Documentation for the Log-Pattern-Analysis Tool and the Data-Distribution Tool #11038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
natebower
merged 11 commits into
opensearch-project:main
from
PauiC:logPatternAnalysis-dataDistribution
Oct 1, 2025
Merged
Changes from 3 commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
baf0385
add logPatternAnalysis and dataDistribution
PauiC 0850dea
fix time format
PauiC 056c41e
fix:display error
PauiC 35fbdb9
Doc review
kolchfa-aws b4c1faf
More rewrites
kolchfa-aws f77508e
add:limitation
PauiC f56bb96
fix: log insights example
PauiC 81730d4
fix: timeField is required
PauiC 1a7b720
fix: timeField
PauiC 9f41bdc
fix: delete baseline
PauiC 1b81f4b
Apply suggestions from code review
natebower File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
208 changes: 208 additions & 0 deletions
208
_ml-commons-plugin/agents-tools/tools/data-distribution-tool.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
--- | ||
layout: default | ||
title: Data Distribution tool | ||
has_children: false | ||
has_toc: false | ||
nav_order: 39 | ||
parent: Tools | ||
grand_parent: Agents and tools | ||
--- | ||
|
||
<!-- vale off --> | ||
# Data Distribution tool | ||
**Introduced 3.3.0** | ||
{: .label .label-purple } | ||
<!-- vale on --> | ||
|
||
The `DataDistributionTool` analyzes data distribution patterns within datasets and compares distributions between different time periods. It supports both single dataset analysis and comparative analysis to identify significant changes in field value distributions, helping detect anomalies, trends, and data quality issues. | ||
|
||
The tool supports both [query domain-specific language (DSL)]({{site.url}}{{site.baseurl}}/query-dsl/) and [Piped Processing Language (PPL)]({{site.url}}{{site.baseurl}}/search-plugins/sql/ppl/index/) queries for flexible data retrieval and filtering. | ||
|
||
## Analysis Modes | ||
|
||
The tool automatically selects the appropriate analysis mode based on the provided parameters: | ||
|
||
- **Comparative Analysis**: When both baseline and selection time ranges are provided, compares field distributions between the two periods to identify significant changes and divergences. | ||
- **Single Dataset Analysis**: When only selection time range is provided, analyzes distribution patterns within the dataset to provide insights into field value frequencies and characteristics. | ||
|
||
## Step 1: Register a flow agent that will run the DataDistributionTool | ||
|
||
A flow agent runs a sequence of tools in order, returning the last tool's output. To create a flow agent, send the following register agent request: | ||
|
||
```json | ||
POST /_plugins/_ml/agents/_register | ||
{ | ||
"name": "Test_Agent_For_Data_Distribution_Tool", | ||
"type": "flow", | ||
"description": "this is a test agent for the DataDistributionTool", | ||
"memory": { | ||
"type": "demo" | ||
}, | ||
"tools": [ | ||
{ | ||
"type": "DataDistributionTool", | ||
"parameters": {} | ||
} | ||
] | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
For parameter descriptions, see [Register parameters](#register-parameters). | ||
|
||
OpenSearch responds with an agent ID: | ||
|
||
```json | ||
{ | ||
"agent_id": "OQutgJYBAc35E4_KvI1q" | ||
} | ||
``` | ||
|
||
## Step 2: Run the agent | ||
|
||
### Comparative Analysis Example | ||
|
||
Run the agent for comparative distribution analysis between two time periods: | ||
|
||
```json | ||
POST /_plugins/_ml/agents/OQutgJYBAc35E4_KvI1q/_execute | ||
{ | ||
"parameters": { | ||
"index": "logs-2025.01.15", | ||
"timeField": "@timestamp", | ||
"selectionTimeRangeStart": "2025-01-15 10:00:00", | ||
"selectionTimeRangeEnd": "2025-01-15 11:00:00", | ||
"baselineTimeRangeStart": "2025-01-15 08:00:00", | ||
"baselineTimeRangeEnd": "2025-01-15 09:00:00", | ||
"size": 1000, | ||
"queryType": "dsl", | ||
"filter": ["{\"term\": {\"status\": \"error\"}}", "{\"range\": {\"response_time\": {\"gte\": 100}}}"] | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
### Single Dataset Analysis Example | ||
|
||
Run the agent for single dataset distribution analysis: | ||
|
||
```json | ||
POST /_plugins/_ml/agents/OQutgJYBAc35E4_KvI1q/_execute | ||
{ | ||
"parameters": { | ||
"index": "application_logs", | ||
"timeField": "@timestamp", | ||
"selectionTimeRangeStart": "2025-01-15 10:00:00", | ||
"selectionTimeRangeEnd": "2025-01-15 11:00:00", | ||
"size": 1000, | ||
"queryType": "dsl" | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
### PPL Query Example | ||
|
||
Run the agent using PPL for data retrieval: | ||
|
||
```json | ||
POST /_plugins/_ml/agents/OQutgJYBAc35E4_KvI1q/_execute | ||
{ | ||
"parameters": { | ||
"index": "logs-2025.01.15", | ||
"timeField": "@timestamp", | ||
"selectionTimeRangeStart": "2025-01-15 10:00:00", | ||
"selectionTimeRangeEnd": "2025-01-15 11:00:00", | ||
"size": 1000, | ||
"queryType": "ppl", | ||
"ppl": "source=logs-2025.01.15 | where status='error'" | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
### Custom DSL Query Example | ||
|
||
Run the agent with a complete custom DSL query: | ||
|
||
```json | ||
POST /_plugins/_ml/agents/OQutgJYBAc35E4_KvI1q/_execute | ||
{ | ||
"parameters": { | ||
"index": "logs-2025.01.15", | ||
"timeField": "@timestamp", | ||
"selectionTimeRangeStart": "2025-01-15 10:00:00", | ||
"selectionTimeRangeEnd": "2025-01-15 11:00:00", | ||
"size": 1000, | ||
"queryType": "dsl", | ||
"dsl": "{\"bool\": {\"must\": [{\"term\": {\"status\": \"error\"}}], \"filter\": [{\"range\": {\"response_time\": {\"gte\": 100}}}]}}" | ||
} | ||
} | ||
``` | ||
{% include copy-curl.html %} | ||
|
||
## Response Examples | ||
|
||
### Comparative Analysis Response | ||
|
||
OpenSearch returns field-by-field comparison showing distribution changes between time periods: | ||
|
||
```json | ||
{ | ||
"inference_results": [ | ||
{ | ||
"output": [ | ||
{ | ||
"name": "response", | ||
"result": "{\"comparisonAnalysis\": [{\"field\": \"status\", \"divergence\": 0.2, \"topChanges\": [{\"value\": \"error\", \"selectionPercentage\": 0.3, \"baselinePercentage\": 0.1}, {\"value\": \"success\", \"selectionPercentage\": 0.7, \"baselinePercentage\": 0.9}]}]}" | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
### Single Dataset Analysis Response | ||
|
||
OpenSearch returns distribution patterns for the analyzed dataset: | ||
|
||
```json | ||
{ | ||
"inference_results": [ | ||
{ | ||
"output": [ | ||
{ | ||
"name": "response", | ||
"result": "{\"singleAnalysis\": [{\"field\": \"status\", \"divergence\": 0.7, \"topChanges\": [{\"value\": \"error\", \"selectionPercentage\": 0.3, \"baselinePercentage\": 0.0}, {\"value\": \"success\", \"selectionPercentage\": 0.7, \"baselinePercentage\": 0.0}]}]}" | ||
} | ||
] | ||
} | ||
] | ||
} | ||
``` | ||
|
||
## Register parameters | ||
|
||
The following table lists the available tool parameters for agent registration. | ||
|
||
| Parameter | Type | Required/Optional | Description | | ||
|:----------|:-----|:------------------|:------------| | ||
| No parameters required for registration | | | The tool uses dynamic parameter validation at execution time. | | ||
|
||
## Execute parameters | ||
|
||
The following table lists the available tool parameters for running the agent. | ||
|
||
| Parameter | Type | Required/Optional | Description | | ||
|:----------|:-----|:------------------|:------------| | ||
| `index` | String | Required | Target OpenSearch index name containing the data to analyze. | | ||
| `timeField` | String | Optional | Date/time field for time-based filtering. Default is `@timestamp`. | | ||
| `selectionTimeRangeStart` | String | Required | Start time for the analysis period (UTC date string, e.g., '2025-01-15 10:00:00'). | | ||
| `selectionTimeRangeEnd` | String | Required | End time for the analysis period (UTC date string, e.g., '2025-01-15 11:00:00'). | | ||
| `baselineTimeRangeStart` | String | Optional | Start time for baseline comparison period. Required for comparative analysis mode. | | ||
| `baselineTimeRangeEnd` | String | Optional | End time for baseline comparison period. Required for comparative analysis mode. | | ||
| `size` | Integer | Optional | Maximum number of documents to analyze. Default is `1000`, maximum is `10000`. | | ||
| `queryType` | String | Optional | Query type: 'ppl' or 'dsl'. Default is 'dsl'. | | ||
| `filter` | Array | Optional | Additional DSL query conditions as JSON strings for filtering (e.g., `["{\"term\": {\"status\": \"error\"}}", "{\"range\": {\"level\": {\"gte\": 3}}}"]`). | | ||
| `dsl` | String | Optional | Complete raw DSL query as JSON string. Takes precedence over filter parameter when provided. | | ||
| `ppl` | String | Optional | Complete PPL statement without time information. Used when queryType is 'ppl'. | |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.