diff --git a/_data-prepper/pipelines/configuration/processors/grok.md b/_data-prepper/pipelines/configuration/processors/grok.md index f54ad09c921..b774d612722 100644 --- a/_data-prepper/pipelines/configuration/processors/grok.md +++ b/_data-prepper/pipelines/configuration/processors/grok.md @@ -24,7 +24,7 @@ This table is autogenerated. Do not edit it. Option | Required | Type | Description :--- | :--- |:--- | :--- `break_on_match` | No | Boolean | Specifies whether to match all patterns (`true`) or stop once the first successful match is found (`false`). Default is `true`. -`grok_when` | No | String | Specifies under what condition the `grok` processor should perform matching. Default is no condition. +`grok_when` | No | String | Specifies under what condition the `grok` processor should perform matching. For information about this expression, see [Expression syntax]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/). Default is no condition. `keep_empty_captures` | No | Boolean | Enables the preservation of `null` captures from the processed output. Default is `false`. `keys_to_overwrite` | No | List | Specifies which existing keys will be overwritten if there is a capture with the same key value. Default is `[]`. `match` | No | Map | Specifies which keys should match specific patterns. Default is an empty response body. @@ -36,26 +36,6 @@ Option | Required | Type | Description `timeout_millis` | No | Integer | The maximum amount of time during which matching occurs. Setting to `0` prevents any matching from occurring. Default is `30,000`. `performance_metadata` | No | Boolean | Whether or not to add the performance metadata to events. Default is `false`. For more information, see [Grok performance metadata](#grok-performance-metadata). - -## Conditional grok - -The `grok` processor can be configured to run conditionally by using the `grok_when` option. The following is an example Grok processor configuration that uses `grok_when`: - -``` -processor: - - grok: - grok_when: '/type == "ipv4"' - match: - message: ['%{IPV4:clientip} %{WORD:request} %{POSINT:bytes}'] - - grok: - grok_when: '/type == "ipv6"' - match: - message: ['%{IPV6:clientip} %{WORD:request} %{POSINT:bytes}'] -``` -{% include copy.html %} - -The `grok_when` option can take a conditional expression. This expression is detailed in the [Expression syntax]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/expression-syntax/) documentation. - ## Grok performance metadata When the `performance_metadata` option is set to `true`, the `grok` processor adds the following metadata keys to each event: @@ -67,11 +47,15 @@ To include Grok performance metadata when the event is sent to the sink inside t ```yaml -processor: + processor: - grok: performance_metadata: true match: - log: "%{COMMONAPACHELOG"} + log: ["%{COMMONAPACHELOG}"] + break_on_match: true + named_captures_only: true + target_key: "parsed" + - add_entries: entries: - add_when: 'getMetadata("_total_grok_patterns_attempted") != null' @@ -82,6 +66,398 @@ processor: value_expression: 'getMetadata("_total_grok_processing_time")' ``` +## Example + +The following examples demonstrate different ways in which the `grok` processor can be configured. + +The examples don't use security and are for demonstration purposes only. We strongly recommend configuring SSL before using these examples in production. +{: .warning} + +### Parse Apache access logs + +This example demonstrates parsing standard Apache HTTP access logs to extract client IP, timestamp, HTTP method, URL, status code, and response size: + +```yaml +apache-access-logs-pipeline: + source: + http: + path: /logs + ssl: false + + processor: + - grok: + match: + message: ['%{COMBINEDAPACHELOG}'] + break_on_match: true + named_captures_only: true + keep_empty_captures: false + target_key: "parsed" + + - date: + match: + - key: "/parsed/timestamp" # JSON Pointer ✔ + patterns: ["dd/MMM/yyyy:HH:mm:ss Z"] + destination: "@timestamp" + source_timezone: "UTC" + + sink: + - opensearch: + hosts: ["https://opensearch:9200"] + insecure: true + username: admin + password: "admin_pass" + index_type: custom + index: "apache-logs-%{yyyy.MM.dd}" +``` +{% include copy.html %} + +You can test this pipeline using the following command: + +```bash +curl -sS -X POST "http://localhost:2021/logs" \ + -H "Content-Type: application/json" \ + -d '[ + {"message":"127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] \"GET /apache.gif HTTP/1.0\" 200 2326 \"http://www.example.com/start.html\" \"Mozilla/4.08 [en] (Win98; I ;Nav)\""}, + {"message":"192.168.1.5 - - [13/Oct/2025:17:42:10 +0000] \"POST /login HTTP/1.1\" 302 512 \"-\" \"curl/8.5.0\""} + ]' +``` +{% include copy.html %} + +The documents stored in OpenSearch contain the following information: + +```json +{ + ... + "hits": { + "total": { + "value": 2, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "apache-logs-2025.10.13", + "_id": "gLO73pkBpMIC6s6zUMMX", + "_score": 1, + "_source": { + "message": "127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] \"GET /apache.gif HTTP/1.0\" 200 2326 \"http://www.example.com/start.html\" \"Mozilla/4.08 [en] (Win98; I ;Nav)\"", + "parsed": { + "request": "/apache.gif", + "referrer": "http://www.example.com/start.html", + "agent": "Mozilla/4.08 [en] (Win98; I ;Nav)", + "auth": "frank", + "ident": "-", + "response": "200", + "bytes": "2326", + "clientip": "127.0.0.1", + "verb": "GET", + "httpversion": "1.0", + "timestamp": "10/Oct/2000:13:55:36 -0700" + }, + "@timestamp": "2000-10-10T20:55:36.000Z" + } + }, + { + "_index": "apache-logs-2025.10.13", + "_id": "gbO73pkBpMIC6s6zUMMX", + "_score": 1, + "_source": { + "message": "192.168.1.5 - - [13/Oct/2025:17:42:10 +0000] \"POST /login HTTP/1.1\" 302 512 \"-\" \"curl/8.5.0\"", + "parsed": { + "request": "/login", + "referrer": "-", + "agent": "curl/8.5.0", + "auth": "-", + "ident": "-", + "response": "302", + "bytes": "512", + "clientip": "192.168.1.5", + "verb": "POST", + "httpversion": "1.1", + "timestamp": "13/Oct/2025:17:42:10 +0000" + }, + "@timestamp": "2025-10-13T17:42:10.000Z" + } + } + ] + } +} +``` + +### Parse application logs with custom patterns + +This example demonstrates parsing custom application logs with user-defined patterns for extracting structured data from proprietary log formats: + +```yaml +application-logs-pipeline: + source: + http: + path: /logs + ssl: false + + processor: + - grok: + match: + message: ['%{TIMESTAMP_ISO8601:timestamp} \[%{LOGLEVEL:level}\] %{DATA:component} - %{GREEDYDATA:details}'] + pattern_definitions: + LOGLEVEL: (?:INFO|WARN|ERROR|DEBUG|TRACE) + break_on_match: true + target_key: "parsed" + keep_empty_captures: false + + - date: + match: + - key: "/parsed/timestamp" + patterns: + - "yyyy-MM-dd HH:mm:ss" + - "yyyy-MM-dd HH:mm:ss.SSS" + destination: "@timestamp" # you could also use "/@timestamp" + source_timezone: "UTC" + + sink: + - opensearch: + hosts: ["https://opensearch:9200"] + insecure: true + username: admin + password: "admin_pass" + index_type: custom + index: "application-logs-%{yyyy.MM.dd}" +``` +{% include copy.html %} + +You can test this pipeline using the following command: + +```bash +curl -sS -X POST "http://localhost:2021/logs" \ + -H "Content-Type: application/json" \ + -d '[ + {"message": "2025-10-13 14:30:45 [INFO] UserService - User login successful"}, + {"message": "2025-10-13 14:31:15 [ERROR] DatabaseConnection - Connection timeout"}, + {"message": "2025-10-13 14:32:30 [DEBUG] CacheManager - Cache hit"}, + {"message": "2025-10-13 14:33:05 [WARN] MetricsCollector - High memory usage detected"} + ]' +``` +{% include copy.html %} + +The documents stored in OpenSearch contain the following information: + +```json +{ + ... + "hits": { + "total": { + "value": 4, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "application-logs-2025.10.13", + "_id": "i7O83pkBpMIC6s6zhsNK", + "_score": 1, + "_source": { + "message": "2025-10-13 14:30:45 [INFO] UserService - User login successful", + "parsed": { + "component": "UserService", + "level": "INFO", + "details": "User login successful", + "timestamp": "2025-10-13 14:30:45" + }, + "@timestamp": "2025-10-13T14:30:45.000Z" + } + }, + { + "_index": "application-logs-2025.10.13", + "_id": "jLO83pkBpMIC6s6zhsNK", + "_score": 1, + "_source": { + "message": "2025-10-13 14:31:15 [ERROR] DatabaseConnection - Connection timeout", + "parsed": { + "component": "DatabaseConnection", + "level": "ERROR", + "details": "Connection timeout", + "timestamp": "2025-10-13 14:31:15" + }, + "@timestamp": "2025-10-13T14:31:15.000Z" + } + }, + { + "_index": "application-logs-2025.10.13", + "_id": "jbO83pkBpMIC6s6zhsNK", + "_score": 1, + "_source": { + "message": "2025-10-13 14:32:30 [DEBUG] CacheManager - Cache hit", + "parsed": { + "component": "CacheManager", + "level": "DEBUG", + "details": "Cache hit", + "timestamp": "2025-10-13 14:32:30" + }, + "@timestamp": "2025-10-13T14:32:30.000Z" + } + }, + { + "_index": "application-logs-2025.10.13", + "_id": "jrO83pkBpMIC6s6zhsNK", + "_score": 1, + "_source": { + "message": "2025-10-13 14:33:05 [WARN] MetricsCollector - High memory usage detected", + "parsed": { + "component": "MetricsCollector", + "level": "WARN", + "details": "High memory usage detected", + "timestamp": "2025-10-13 14:33:05" + }, + "@timestamp": "2025-10-13T14:33:05.000Z" + } + } + ] + } +} +``` + +### Parse network device logs with multiple patterns + +This example demonstrates using multiple `grok` patterns to handle different log formats from network devices, with conditional processing based on log type: + +```yaml +network-device-logs-pipeline: + source: + http: + path: /logs + ssl: false + + processor: + - grok: + match: + message: [ + # syslog-like + '%{SYSLOGTIMESTAMP:timestamp} %{SYSLOGHOST:host} %{DATA:program}(?:\[%{POSINT:pid}\])?: %{GREEDYDATA:message}', + # ISO8601 + IP + '%{TIMESTAMP_ISO8601:timestamp} %{IP:host} %{DATA:program}: %{GREEDYDATA:message}', + # Cisco style + '%{CISCO_TIMESTAMP:timestamp}: %{DATA:facility}-%{INT:severity}-%{DATA:mnemonic}: %{GREEDYDATA:message}' + ] + break_on_match: true + named_captures_only: true + pattern_definitions: + CISCO_TIMESTAMP: '%{MONTH} %{MONTHDAY} %{TIME}' + target_key: "parsed" + timeout_millis: 5000 + + # Extract login info from the parsed message text + - grok: + match: + /parsed/message: ['User %{USERNAME:user} logged in from %{IP:source_ip}'] + break_on_match: true + target_key: "login_info" + + - date: + match: + - key: "/parsed/timestamp" + patterns: + # syslog-like (no year) + - "MMM d HH:mm:ss" + # ISO8601 without/with millis & TZ + - "yyyy-MM-dd'T'HH:mm:ssXXX" + - "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" + destination: "@timestamp" + source_timezone: "UTC" + + sink: + - opensearch: + hosts: ["https://opensearch:9200"] + insecure: true + username: admin + password: "admin_pass" + index_type: custom + index: "network-logs-%{yyyy.MM.dd}" +``` +{% include copy.html %} + +You can test this pipeline using the following command: + +```bash +curl -sS -X POST "http://localhost:2021/logs" \ + -H "Content-Type: application/json" \ + -d '[ + {"message":"Oct 13 14:30:45 router1 sshd[1234]: User alice logged in from 10.0.0.5"}, + {"message":"2025-10-13T16:01:22Z 192.168.0.10 dhcpd: Lease granted to 192.168.0.55"}, + {"message":"Oct 13 16:30:45: LOCAL4-3-LINK_UPDOWN: Interface Gi0/1 changed state to up"} + ]' +``` +{% include copy.html %} + +The documents stored in OpenSearch contain the following information: + +```json +{ + ... + "hits": { + "total": { + "value": 3, + "relation": "eq" + }, + "max_score": 1, + "hits": [ + { + "_index": "network-logs-2025.10.13", + "_id": "-kzC3pkBl88jNjkRQ1TJ", + "_score": 1, + "_source": { + "message": "Oct 13 14:30:45 router1 sshd[1234]: User alice logged in from 10.0.0.5", + "parsed": { + "host": "router1", + "pid": "1234", + "program": "sshd", + "message": "User alice logged in from 10.0.0.5", + "timestamp": "Oct 13 14:30:45" + }, + "login_info": { + "user": "alice", + "source_ip": "10.0.0.5" + }, + "@timestamp": "2025-10-13T14:30:45.000Z" + } + }, + { + "_index": "network-logs-2025.10.13", + "_id": "-0zC3pkBl88jNjkRQ1TJ", + "_score": 1, + "_source": { + "message": "2025-10-13T16:01:22Z 192.168.0.10 dhcpd: Lease granted to 192.168.0.55", + "parsed": { + "host": "192.168.0.10", + "program": "dhcpd", + "message": "Lease granted to 192.168.0.55", + "timestamp": "2025-10-13T16:01:22Z" + }, + "login_info": {}, + "@timestamp": "2025-10-13T16:01:22.000Z" + } + }, + { + "_index": "network-logs-2025.10.13", + "_id": "_EzC3pkBl88jNjkRQ1TJ", + "_score": 1, + "_source": { + "message": "Oct 13 16:30:45: LOCAL4-3-LINK_UPDOWN: Interface Gi0/1 changed state to up", + "parsed": { + "severity": "3", + "mnemonic": "LINK_UPDOWN", + "message": "Interface Gi0/1 changed state to up", + "facility": "LOCAL4", + "timestamp": "Oct 13 16:30:45" + }, + "login_info": {}, + "@timestamp": "2025-10-13T16:30:45.000Z" + } + } + ] + } +} +``` + ## Metrics The following table describes common [Abstract processor](https://github.com/opensearch-project/data-prepper/blob/main/data-prepper-api/src/main/java/org/opensearch/dataprepper/model/processor/AbstractProcessor.java) metrics.