diff --git a/_data-prepper/common-use-cases/log-analytics.md b/_data-prepper/common-use-cases/log-analytics.md index 242e16dfe94..715200ea72a 100644 --- a/_data-prepper/common-use-cases/log-analytics.md +++ b/_data-prepper/common-use-cases/log-analytics.md @@ -67,7 +67,7 @@ log-pipeline: # Change to your credentials username: "admin" password: "admin" - # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate + # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate #cert: /path/to/cert # If you are connecting to an Amazon OpenSearch Service domain without # Fine-Grained Access Control, enable these settings. Comment out the @@ -78,6 +78,7 @@ log-pipeline: # You should change this to correspond with how your OpenSearch indexes are set up. index: apache_logs ``` +{% include copy.html %} This pipeline configuration is an example of Apache log ingestion. Don't forget that you can easily configure the Grok Processor for your own custom logs. You will need to modify the configuration for your OpenSearch cluster. @@ -100,7 +101,7 @@ Note that you should adjust the file `path`, output `Host`, and `Port` according The following is an example `fluent-bit.conf` file without SSL and basic authentication enabled on the HTTP source: -``` +```text [INPUT] name tail refresh_interval 5 @@ -115,6 +116,7 @@ The following is an example `fluent-bit.conf` file without SSL and basic authent URI /log/ingest Format json ``` +{% include copy.html %} If your HTTP source has SSL and basic authentication enabled, you will need to add the details of `http_User`, `http_Passwd`, `tls.crt_file`, and `tls.key_file` to the `fluent-bit.conf` file, as shown in the following example. @@ -122,7 +124,7 @@ If your HTTP source has SSL and basic authentication enabled, you will need to a The following is an example `fluent-bit.conf` file with SSL and basic authentication enabled on the HTTP source: -``` +```text [INPUT] name tail refresh_interval 5 @@ -142,6 +144,7 @@ The following is an example `fluent-bit.conf` file with SSL and basic authentica URI /log/ingest Format json ``` +{% include copy.html %} # Next steps diff --git a/_data-prepper/common-use-cases/trace-analytics.md b/_data-prepper/common-use-cases/trace-analytics.md index 2c1351d4ee8..47c0f2fe051 100644 --- a/_data-prepper/common-use-cases/trace-analytics.md +++ b/_data-prepper/common-use-cases/trace-analytics.md @@ -116,7 +116,7 @@ The following example demonstrates how to build a pipeline that supports the [Op Starting with Data Prepper version 2.0, Data Prepper no longer supports the `otel_traces_prepper` processor. The `otel_traces` processor replaces the `otel_traces_prepper` processor and supports some of Data Prepper's recent data model changes. Instead, you should use the `otel_traces` processor. See the following YAML file example: -```yml +```yaml entry-pipeline: delay: "100" source: @@ -167,6 +167,7 @@ service-map-pipeline: password: admin index_type: trace-analytics-service-map ``` +{% include copy.html %} To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. {: .tip} @@ -186,6 +187,7 @@ source: username: "my-user" password: "my_s3cr3t" ``` +{% include copy.html %} #### Example: pipeline.yaml @@ -193,14 +195,14 @@ The following is an example `pipeline.yaml` file without SSL and basic authentic ```yaml otel-trace-pipeline: - # workers is the number of threads processing data in each pipeline. + # workers is the number of threads processing data in each pipeline. # We recommend same value for all pipelines. # default value is 1, set a value based on the machine you are running Data Prepper - workers: 8 + workers: 8 # delay in milliseconds is how often the worker threads should process data. # Recommend not to change this config as we want the entry-pipeline to process as quick as possible # default value is 3_000 ms - delay: "100" + delay: "100" source: otel_trace_source: #record_type: event # Add this when using Data Prepper 1.x. This option is removed in 2.0 @@ -209,8 +211,8 @@ otel-trace-pipeline: unauthenticated: buffer: bounded_blocking: - # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. - # We recommend to keep the same buffer_size for all pipelines. + # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. + # We recommend to keep the same buffer_size for all pipelines. # Make sure you configure sufficient heap # default value is 512 buffer_size: 512 @@ -225,9 +227,9 @@ otel-trace-pipeline: name: "entry-pipeline" raw-trace-pipeline: # Configure same as the otel-trace-pipeline - workers: 8 + workers: 8 # We recommend using the default value for the raw-trace-pipeline. - delay: "3000" + delay: "3000" source: pipeline: name: "entry-pipeline" @@ -248,7 +250,7 @@ raw-trace-pipeline: # Change to your credentials username: "admin" password: "admin" - # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate + # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate #cert: /path/to/cert # If you are connecting to an Amazon OpenSearch Service domain without # Fine-Grained Access Control, enable these settings. Comment out the @@ -262,7 +264,7 @@ raw-trace-pipeline: # Change to your credentials username: "admin" password: "admin" - # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate + # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate #cert: /path/to/cert # If you are connecting to an Amazon OpenSearch Service domain without # Fine-Grained Access Control, enable these settings. Comment out the @@ -277,14 +279,14 @@ service-map-pipeline: name: "entry-pipeline" processor: - service_map: - # The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships. + # The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships. # The default is 3 minutes, this means we can detect relationships between services from spans reported in last 3 minutes. - # Set higher value if your applications have higher latency. - window_duration: 180 + # Set higher value if your applications have higher latency. + window_duration: 180 buffer: bounded_blocking: - # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. - # We recommend to keep the same buffer_size for all pipelines. + # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. + # We recommend to keep the same buffer_size for all pipelines. # Make sure you configure sufficient heap # default value is 512 buffer_size: 512 @@ -299,7 +301,7 @@ service-map-pipeline: # Change to your credentials username: "admin" password: "admin" - # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate + # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate #cert: /path/to/cert # If you are connecting to an Amazon OpenSearch Service domain without # Fine-Grained Access Control, enable these settings. Comment out the @@ -307,6 +309,7 @@ service-map-pipeline: #aws_sigv4: true #aws_region: us-east-1 ``` +{% include copy.html %} You need to modify the preceding configuration for your OpenSearch cluster so that the configuration matches your environment. Note that it has two `opensearch` sinks that need to be modified. {: .note} @@ -328,7 +331,7 @@ You need to run OpenTelemetry Collector in your service environment. Follow [Get The following is an example `otel-collector-config.yaml` file: -``` +```yaml receivers: jaeger: protocols: @@ -356,6 +359,7 @@ service: processors: [batch/traces] exporters: [otlp/data-prepper] ``` +{% include copy.html %} After you run OpenTelemetry in your service environment, you must configure your application to use the OpenTelemetry Collector. The OpenTelemetry Collector typically runs alongside your application. diff --git a/_data-prepper/getting-started.md b/_data-prepper/getting-started.md index 92a38adafed..533fff8a3c4 100644 --- a/_data-prepper/getting-started.md +++ b/_data-prepper/getting-started.md @@ -19,7 +19,7 @@ There are two ways to install Data Prepper: you can run the Docker image or buil The easiest way to use Data Prepper is by running the Docker image. We suggest that you use this approach if you have [Docker](https://www.docker.com) available. Run the following command: -``` +```bash docker pull opensearchproject/data-prepper:latest ``` {% include copy.html %} @@ -36,27 +36,30 @@ Two configuration files are required to run a Data Prepper instance. Optionally, For Data Prepper versions earlier than 2.0, the `.jar` file expects the pipeline configuration file path to be followed by the server configuration file path. See the following configuration path example: -``` +```bash java -jar data-prepper-core-$VERSION.jar pipelines.yaml data-prepper-config.yaml ``` +{% include copy.html %} Optionally, you can add `"-Dlog4j.configurationFile=config/log4j2.properties"` to the command to pass a custom Log4j 2 configuration file. If you don't provide a properties file, Data Prepper defaults to the `log4j2.properties` file in the `shared-config` directory. Starting with Data Prepper 2.0, you can launch Data Prepper by using the following `data-prepper` script that does not require any additional command line arguments: -``` +```bash bin/data-prepper ``` +{% include copy.html %} Configuration files are read from specific subdirectories in the application's home directory: 1. `pipelines/`: Used for pipeline configurations. Pipeline configurations can be written in one or more YAML files. 2. `config/data-prepper-config.yaml`: Used for the Data Prepper server configuration. You can supply your own pipeline configuration file path followed by the server configuration file path. However, this method will not be supported in a future release. See the following example: -``` +```bash bin/data-prepper pipelines.yaml data-prepper-config.yaml ``` +{% include copy.html %} The Log4j 2 configuration file is read from the `config/log4j2.properties` file located in the application's home directory. @@ -69,7 +72,7 @@ To configure Data Prepper, see the following information for each use case: Create a Data Prepper pipeline file named `pipelines.yaml` using the following configuration: -```yml +```yaml simple-sample-pipeline: workers: 2 delay: "5000" @@ -96,7 +99,7 @@ The example pipeline configuration above demonstrates a simple pipeline with a s After starting Data Prepper, you should see log output and some UUIDs after a few seconds: -```yml +```text 2021-09-30T20:19:44,147 [main] INFO com.amazon.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900 2021-09-30T20:19:44,681 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer 2021-09-30T20:19:45,183 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer @@ -120,21 +123,21 @@ image and modify both the `pipelines.yaml` and `data-prepper-config.yaml` files. For Data Prepper 2.0 or later, use this command: -``` +```bash docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml opensearchproject/data-prepper:latest ``` {% include copy.html %} For Data Prepper versions earlier than 2.0, use this command: -``` +```bash docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/data-prepper:1.x ``` {% include copy.html %} Once Data Prepper is running, it processes data until it is shut down. Once you are done, shut it down with the following command: -``` +```bash POST /shutdown ``` {% include copy-curl.html %} diff --git a/_data-prepper/pipelines/configuration/processors/aggregate.md b/_data-prepper/pipelines/configuration/processors/aggregate.md index 781ce61a3fa..2296ed58c59 100644 --- a/_data-prepper/pipelines/configuration/processors/aggregate.md +++ b/_data-prepper/pipelines/configuration/processors/aggregate.md @@ -38,7 +38,7 @@ The `remove_duplicates` action processes the first event for a group immediately The `put_all` action combines events belonging to the same group by overwriting existing keys and adding new keys, similarly to the Java `Map.putAll`. The action drops all events that make up the combined event. For example, when using `identification_keys: ["sourceIp", "destination_ip"]`, the `put_all` action processes the following three events: -``` +```json { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "http_verb": "GET" } @@ -46,7 +46,7 @@ The `put_all` action combines events belonging to the same group by overwriting Then the action combines the events into one. The pipeline then uses the following combined event: -``` +```json { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200, "bytes": 1000, "http_verb": "GET" } ``` @@ -93,7 +93,7 @@ You can customize the processor with the following configuration options: For example, when using `identification_keys: ["sourceIp", "destination_ip", "request"]`, `key: latency`, and `buckets: [0.0, 0.25, 0.5]`, the `histogram` action processes the following events: -``` +```json { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.2 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.55} { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.25 } @@ -139,7 +139,7 @@ You can set the percentage of events using the `percent` configuration, which in For example, if percent is set to `50`, the action tries to process the following events in the one-second interval: -``` +```json { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 2500 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 } @@ -148,7 +148,7 @@ For example, if percent is set to `50`, the action tries to process the followin The pipeline processes 50% of the events, drops the other events, and does not generate a new event: -``` +```json { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 } { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 3100 } ``` diff --git a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md index 0b35e8387c6..8bbeeb3ead9 100644 --- a/_data-prepper/pipelines/configuration/processors/anomaly-detector.md +++ b/_data-prepper/pipelines/configuration/processors/anomaly-detector.md @@ -64,13 +64,14 @@ To get started, create the following `pipeline.yaml` file. You can use the follo ad-pipeline: source: ... - .... + .... processor: - anomaly_detector: keys: ["latency"] - mode: + mode: random_cut_forest: ``` +{% include copy.html %} When you run the `anomaly_detector` processor, the processor extracts the value for the `latency` key and then passes the value through the RCF ML algorithm. You can configure any key that comprises integers or real numbers as values. In the following example, you can configure `bytes` or `latency` as the key for an anomaly detector. diff --git a/_data-prepper/pipelines/configuration/processors/aws-lambda.md b/_data-prepper/pipelines/configuration/processors/aws-lambda.md index 65a2f0a1855..77cf6f05159 100644 --- a/_data-prepper/pipelines/configuration/processors/aws-lambda.md +++ b/_data-prepper/pipelines/configuration/processors/aws-lambda.md @@ -42,7 +42,7 @@ Field | Type | Required | Description #### Example configuration -``` +```yaml processors: - aws_lambda: function_name: "my-lambda-function" @@ -62,7 +62,6 @@ processors: maximum_size: "5mb" event_collect_timeout: PT10S lambda_when: "event['status'] == 'process'" - ``` {% include copy.html %} @@ -98,7 +97,7 @@ Note the following limitations: Integration tests for this plugin are executed separately from the main Data Prepper build process. Use the following Gradle command to run these tests: -``` +```bash ./gradlew :data-prepper-plugins:aws-lambda:integrationTest -Dtests.processor.lambda.region="us-east-1" -Dtests.processor.lambda.functionName="lambda_test_function" -Dtests.processor.lambda.sts_role_arn="arn:aws:iam::123456789012:role/dataprepper-role ``` {% include copy.html %} diff --git a/_data-prepper/pipelines/configuration/processors/csv.md b/_data-prepper/pipelines/configuration/processors/csv.md index fb9fc6f9d6e..1b6f5b08ec9 100644 --- a/_data-prepper/pipelines/configuration/processors/csv.md +++ b/_data-prepper/pipelines/configuration/processors/csv.md @@ -48,9 +48,10 @@ csv-pipeline: When run, the processor will parse the message. Although only two column names are specified in processor settings, a third column name is automatically generated because the data contained in `ingest.csv` includes three columns, `1,2,3`: -``` +```json {"message": "1,2,3", "col1": "1", "col2": "2", "column3": "3"} ``` + ### Automatically detect column names The following configuration automatically detects the header of a CSV file ingested through an [`s3 source`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/): @@ -80,7 +81,7 @@ csv-s3-pipeline: For example, if the `ingest.csv` file in the Amazon Simple Storage Service (Amazon S3) bucket that the Amazon Simple Queue Service (SQS) queue is attached to contains the following data: -``` +```text Should,skip,this,line a,b,c 1,2,3 diff --git a/_data-prepper/pipelines/configuration/processors/date.md b/_data-prepper/pipelines/configuration/processors/date.md index 1b442618ea6..559604c5301 100644 --- a/_data-prepper/pipelines/configuration/processors/date.md +++ b/_data-prepper/pipelines/configuration/processors/date.md @@ -66,6 +66,7 @@ The following `date` processor configuration can be used to add a default timest from_time_received: true destination: "@timestamp" ``` +{% include copy.html %} ## Example: Parse a timestamp to convert its format and time zone The following `date` processor configuration can be used to parse the value of the timestamp applied to `dd/MMM/yyyy:HH:mm:ss` and write it in `yyyy-MM-dd'T'HH:mm:ss.SSSXXX` format: @@ -74,10 +75,11 @@ The following `date` processor configuration can be used to parse the value of t - date: match: - key: timestamp - patterns: ["dd/MMM/yyyy:HH:mm:ss"] + patterns: ["dd/MMM/yyyy:HH:mm:ss"] destination: "@timestamp" output_format: "yyyy-MM-dd'T'HH:mm:ss.SSSXXX" source_timezone: "America/Los_Angeles" destination_timezone: "America/Chicago" locale: "en_US" ``` +{% include copy.html %} diff --git a/_data-prepper/pipelines/configuration/processors/decompress.md b/_data-prepper/pipelines/configuration/processors/decompress.md index 1dc44222bf4..030e8733bb3 100644 --- a/_data-prepper/pipelines/configuration/processors/decompress.md +++ b/_data-prepper/pipelines/configuration/processors/decompress.md @@ -30,6 +30,7 @@ processor: keys: [ "base_64_gzip_key" ] type: gzip ``` +{% include copy.html %} ## Metrics diff --git a/_data-prepper/pipelines/configuration/processors/delay.md b/_data-prepper/pipelines/configuration/processors/delay.md index a2b80bbace1..c7e716d85f9 100644 --- a/_data-prepper/pipelines/configuration/processors/delay.md +++ b/_data-prepper/pipelines/configuration/processors/delay.md @@ -25,3 +25,4 @@ processor: - delay: for: 2s ``` +{% include copy.html %} diff --git a/_data-prepper/pipelines/configuration/processors/dissect.md b/_data-prepper/pipelines/configuration/processors/dissect.md index c0a776c6b2e..22cbb792586 100644 --- a/_data-prepper/pipelines/configuration/processors/dissect.md +++ b/_data-prepper/pipelines/configuration/processors/dissect.md @@ -28,10 +28,11 @@ dissect-pipeline: sink: - stdout: ``` +{% include copy.html %} Then create the following file named `logs_json.log` and replace the `path` in the file source of your `pipeline.yaml` file with the path of a file containing the following JSON data: -``` +```json {"log": "07-25-2023 10:00:00 ERROR: error message"} ``` @@ -39,7 +40,7 @@ The `dissect` processor will retrieve the fields (`Date`, `Time`, `Log_Type`, an After running the pipeline, you should receive the following standard output: -``` +```json { "log" : "07-25-2023 10:00:00 ERROR: Some error", "Date" : "07-25-2023" diff --git a/_data-prepper/pipelines/configuration/processors/flatten.md b/_data-prepper/pipelines/configuration/processors/flatten.md index e3c589d63a0..ddb3fe0dfc5 100644 --- a/_data-prepper/pipelines/configuration/processors/flatten.md +++ b/_data-prepper/pipelines/configuration/processors/flatten.md @@ -84,6 +84,7 @@ Use the `remove_processed_fields` option when flattening all of an event's neste remove_processed_fields: true ... ``` +{% include copy.html %} For example, when the input event contains the following nested objects: @@ -140,6 +141,7 @@ Use the `exclude_keys` option to prevent specific keys from being flattened in t exclude_keys: ["key2"] ... ``` +{% include copy.html %} For example, when the input event contains the following nested objects: @@ -199,6 +201,7 @@ Use the `remove_list_indices` option to convert the fields from the source map i remove_list_indices: true ... ``` +{% include copy.html %} For example, when the input event contains the following nested objects: diff --git a/_data-prepper/pipelines/configuration/processors/geoip.md b/_data-prepper/pipelines/configuration/processors/geoip.md index dcc4e9fa8e3..125101b5f5e 100644 --- a/_data-prepper/pipelines/configuration/processors/geoip.md +++ b/_data-prepper/pipelines/configuration/processors/geoip.md @@ -21,17 +21,18 @@ The minimal configuration requires at least one entry, and each entry at least o The following configuration extracts all available geolocation data from the IP address provided in the field named `clientip`. It will write the geolocation data to a new field named `geo`, the default source when none is configured: -``` +```yaml my-pipeline: processor: - geoip: entries: - source: clientip ``` +{% include copy.html %} The following example excludes Autonomous System Number (ASN) fields and puts the geolocation data into a field named `clientlocation`: -``` +```yaml my-pipeline: processor: - geoip: @@ -40,6 +41,7 @@ my-pipeline: target: clientlocation include_fields: [asn, asn_organization, network] ``` +{% include copy.html %} ## Configuration