opensearch-project · github-actions · Oct 30, 2025 · Oct 30, 2025
@@ -67,7 +67,7 @@ log-pipeline:
         # Change to your credentials
         username: "admin"
         password: "admin"
-        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate  
+        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
         #cert: /path/to/cert
         # If you are connecting to an Amazon OpenSearch Service domain without
         # Fine-Grained Access Control, enable these settings. Comment out the
@@ -78,6 +78,7 @@ log-pipeline:
         # You should change this to correspond with how your OpenSearch indexes are set up.
         index: apache_logs
 ```
+{% include copy.html %}
 
 This pipeline configuration is an example of Apache log ingestion. Don't forget that you can easily configure the Grok Processor for your own custom logs. You will need to modify the configuration for your OpenSearch cluster.
 
@@ -100,7 +101,7 @@ Note that you should adjust the file `path`, output `Host`, and `Port` according
 
 The following is an example `fluent-bit.conf` file without SSL and basic authentication enabled on the HTTP source:
 
-```
+```text
 [INPUT]
   name                  tail
   refresh_interval      5
@@ -115,14 +116,15 @@ The following is an example `fluent-bit.conf` file without SSL and basic authent
   URI /log/ingest
   Format json
 ```
+{% include copy.html %}
 
 If your HTTP source has SSL and basic authentication enabled, you will need to add the details of `http_User`, `http_Passwd`, `tls.crt_file`, and `tls.key_file` to the `fluent-bit.conf` file, as shown in the following example.
 
 ### Example: Fluent Bit file with SSL and basic authentication enabled
 
 The following is an example `fluent-bit.conf` file with SSL and basic authentication enabled on the HTTP source:
 
-```
+```text
 [INPUT]
   name                  tail
   refresh_interval      5
@@ -142,6 +144,7 @@ The following is an example `fluent-bit.conf` file with SSL and basic authentica
   URI /log/ingest
   Format json
 ```
+{% include copy.html %}
 
 # Next steps
 

@@ -116,7 +116,7 @@ The following example demonstrates how to build a pipeline that supports the [Op
 
 Starting with Data Prepper version 2.0, Data Prepper no longer supports the `otel_traces_prepper` processor. The `otel_traces` processor replaces the `otel_traces_prepper` processor and supports some of Data Prepper's recent data model changes. Instead, you should use the `otel_traces` processor. See the following YAML file example:
 
-```yml
+```yaml
 entry-pipeline:
   delay: "100"
   source:
@@ -167,6 +167,7 @@ service-map-pipeline:
         password: admin
         index_type: trace-analytics-service-map
 ```
+{% include copy.html %}
 
 To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. {: .tip}
 
@@ -186,21 +187,22 @@ source:
         username: "my-user"
         password: "my_s3cr3t"
 ```
+{% include copy.html %}
 
 #### Example: pipeline.yaml
 
 The following is an example `pipeline.yaml` file without SSL and basic authentication enabled for the `otel-trace-pipeline` pipeline:
 
 ```yaml
 otel-trace-pipeline:
-  # workers is the number of threads processing data in each pipeline. 
+  # workers is the number of threads processing data in each pipeline.
   # We recommend same value for all pipelines.
   # default value is 1, set a value based on the machine you are running Data Prepper
-  workers: 8 
+  workers: 8
   # delay in milliseconds is how often the worker threads should process data.
   # Recommend not to change this config as we want the entry-pipeline to process as quick as possible
   # default value is 3_000 ms
-  delay: "100" 
+  delay: "100"
   source:
     otel_trace_source:
       #record_type: event  # Add this when using Data Prepper 1.x. This option is removed in 2.0
@@ -209,8 +211,8 @@ otel-trace-pipeline:
         unauthenticated:
   buffer:
     bounded_blocking:
-       # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. 
-       # We recommend to keep the same buffer_size for all pipelines. 
+       # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
+       # We recommend to keep the same buffer_size for all pipelines.
        # Make sure you configure sufficient heap
        # default value is 512
        buffer_size: 512
@@ -225,9 +227,9 @@ otel-trace-pipeline:
         name: "entry-pipeline"
 raw-trace-pipeline:
   # Configure same as the otel-trace-pipeline
-  workers: 8 
+  workers: 8
   # We recommend using the default value for the raw-trace-pipeline.
-  delay: "3000" 
+  delay: "3000"
   source:
     pipeline:
       name: "entry-pipeline"
@@ -248,7 +250,7 @@ raw-trace-pipeline:
         # Change to your credentials
         username: "admin"
         password: "admin"
-        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate  
+        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
         #cert: /path/to/cert
         # If you are connecting to an Amazon OpenSearch Service domain without
         # Fine-Grained Access Control, enable these settings. Comment out the
@@ -262,7 +264,7 @@ raw-trace-pipeline:
         # Change to your credentials
         username: "admin"
         password: "admin"
-        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate  
+        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
         #cert: /path/to/cert
         # If you are connecting to an Amazon OpenSearch Service domain without
         # Fine-Grained Access Control, enable these settings. Comment out the
@@ -277,14 +279,14 @@ service-map-pipeline:
       name: "entry-pipeline"
   processor:
     - service_map:
-        # The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships. 
+        # The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships.
         # The default is 3 minutes, this means we can detect relationships between services from spans reported in last 3 minutes.
-        # Set higher value if your applications have higher latency. 
-        window_duration: 180 
+        # Set higher value if your applications have higher latency.
+        window_duration: 180
   buffer:
       bounded_blocking:
-         # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory. 
-         # We recommend to keep the same buffer_size for all pipelines. 
+         # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
+         # We recommend to keep the same buffer_size for all pipelines.
          # Make sure you configure sufficient heap
          # default value is 512
          buffer_size: 512
@@ -299,14 +301,15 @@ service-map-pipeline:
         # Change to your credentials
         username: "admin"
         password: "admin"
-        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate  
+        # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
         #cert: /path/to/cert
         # If you are connecting to an Amazon OpenSearch Service domain without
         # Fine-Grained Access Control, enable these settings. Comment out the
         # username and password above.
         #aws_sigv4: true
         #aws_region: us-east-1
 ```
+{% include copy.html %}
 
 You need to modify the preceding configuration for your OpenSearch cluster so that the configuration matches your environment. Note that it has two `opensearch` sinks that need to be modified.
 {: .note}
@@ -328,7 +331,7 @@ You need to run OpenTelemetry Collector in your service environment. Follow [Get
 
 The following is an example `otel-collector-config.yaml` file:
 
-```
+```yaml
 receivers:
   jaeger:
     protocols:
@@ -356,6 +359,7 @@ service:
       processors: [batch/traces]
       exporters: [otlp/data-prepper]
 ```
+{% include copy.html %}
 
 After you run OpenTelemetry in your service environment, you must configure your application to use the OpenTelemetry Collector. The OpenTelemetry Collector typically runs alongside your application.
 

@@ -19,7 +19,7 @@ There are two ways to install Data Prepper: you can run the Docker image or buil
 
 The easiest way to use Data Prepper is by running the Docker image. We suggest that you use this approach if you have [Docker](https://www.docker.com) available. Run the following command:  
 
-```
+```bash
 docker pull opensearchproject/data-prepper:latest
 ```
 {% include copy.html %}
@@ -36,27 +36,30 @@ Two configuration files are required to run a Data Prepper instance. Optionally,
 
 For Data Prepper versions earlier than 2.0, the `.jar` file expects the pipeline configuration file path to be followed by the server configuration file path. See the following configuration path example:
 
-```
+```bash
 java -jar data-prepper-core-$VERSION.jar pipelines.yaml data-prepper-config.yaml
 ```
+{% include copy.html %}
 
 Optionally, you can add `"-Dlog4j.configurationFile=config/log4j2.properties"` to the command to pass a custom Log4j 2 configuration file. If you don't provide a properties file, Data Prepper defaults to the `log4j2.properties` file in the `shared-config` directory.
 
 
 Starting with Data Prepper 2.0, you can launch Data Prepper by using the following `data-prepper` script that does not require any additional command line arguments:
 
-```
+```bash
 bin/data-prepper
 ```
+{% include copy.html %}
 
 Configuration files are read from specific subdirectories in the application's home directory:
 1. `pipelines/`: Used for pipeline configurations. Pipeline configurations can be written in one or more YAML files.
 2. `config/data-prepper-config.yaml`: Used for the Data Prepper server configuration.
 
 You can supply your own pipeline configuration file path followed by the server configuration file path. However, this method will not be supported in a future release. See the following example:
-```
+```bash
 bin/data-prepper pipelines.yaml data-prepper-config.yaml
 ```
+{% include copy.html %}
 
 The Log4j 2 configuration file is read from the `config/log4j2.properties` file located in the application's home directory.
 
@@ -69,7 +72,7 @@ To configure Data Prepper, see the following information for each use case:
 
 Create a Data Prepper pipeline file named `pipelines.yaml` using the following configuration:
 
-```yml
+```yaml
 simple-sample-pipeline:
   workers: 2
   delay: "5000"
@@ -96,7 +99,7 @@ The example pipeline configuration above demonstrates a simple pipeline with a s
 
 After starting Data Prepper, you should see log output and some UUIDs after a few seconds:
 
-```yml
+```text
 2021-09-30T20:19:44,147 [main] INFO  com.amazon.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900
 2021-09-30T20:19:44,681 [random-source-pool-0] INFO  com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer
 2021-09-30T20:19:45,183 [random-source-pool-0] INFO  com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer
@@ -120,21 +123,21 @@ image and modify both the `pipelines.yaml` and `data-prepper-config.yaml` files.
 
 For Data Prepper 2.0 or later, use this command:
 
-```
+```bash
 docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml opensearchproject/data-prepper:latest
 ```
 {% include copy.html %}
 
 For Data Prepper versions earlier than 2.0, use this command:
 
-```
+```bash
 docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/data-prepper:1.x
 ```
 {% include copy.html %}
 
 Once Data Prepper is running, it processes data until it is shut down. Once you are done, shut it down with the following command:
 
-```
+```bash
 POST /shutdown
 ```
 {% include copy-curl.html %}

@@ -38,15 +38,15 @@ The `remove_duplicates` action processes the first event for a group immediately
 
 The `put_all` action combines events belonging to the same group by overwriting existing keys and adding new keys, similarly to the Java `Map.putAll`. The action drops all events that make up the combined event. For example, when using `identification_keys: ["sourceIp", "destination_ip"]`, the `put_all` action processes the following three events:
 
-```
+```json
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "http_verb": "GET" }
 ```
 
 Then the action combines the events into one. The pipeline then uses the following combined event:
 
-```
+```json
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200, "bytes": 1000, "http_verb": "GET" }
 ```
 
@@ -93,7 +93,7 @@ You can customize the processor with the following configuration options:
 
 For example, when using `identification_keys: ["sourceIp", "destination_ip", "request"]`, `key: latency`, and `buckets: [0.0, 0.25, 0.5]`, the `histogram` action processes the following events:
 
-```
+```json
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.2 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.55}
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.25 }
@@ -139,7 +139,7 @@ You can set the percentage of events using the `percent` configuration, which in
 
 For example, if percent is set to `50`, the action tries to process the following events in the one-second interval:
 
-```
+```json
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 2500 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 }
@@ -148,7 +148,7 @@ For example, if percent is set to `50`, the action tries to process the followin
 
 The pipeline processes 50% of the events, drops the other events, and does not generate a new event:
 
-```
+```json
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 }
 { "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 3100 }
 ```

@@ -64,13 +64,14 @@ To get started, create the following `pipeline.yaml` file. You can use the follo
 ad-pipeline:
   source:
     ...
-  ....  
+  ....
   processor:
     - anomaly_detector:
         keys: ["latency"]
-        mode: 
+        mode:
             random_cut_forest:
 ```
+{% include copy.html %}
 
 When you run the `anomaly_detector` processor, the processor extracts the value for the `latency` key and then passes the value through the RCF ML algorithm. You can configure any key that comprises integers or real numbers as values. In the following example, you can configure `bytes` or `latency` as the key for an anomaly detector. 
 

@@ -42,7 +42,7 @@ Field                | Type    | Required | Description
 
 #### Example configuration
 
-```
+```yaml
 processors:
   - aws_lambda:
       function_name: "my-lambda-function"
@@ -62,7 +62,6 @@ processors:
           maximum_size: "5mb"
           event_collect_timeout: PT10S
       lambda_when: "event['status'] == 'process'"
-
 ```
 {% include copy.html %}
 
@@ -98,7 +97,7 @@ Note the following limitations:
 
 Integration tests for this plugin are executed separately from the main Data Prepper build process. Use the following Gradle command to run these tests:
 
-```
+```bash
 ./gradlew :data-prepper-plugins:aws-lambda:integrationTest -Dtests.processor.lambda.region="us-east-1" -Dtests.processor.lambda.functionName="lambda_test_function"  -Dtests.processor.lambda.sts_role_arn="arn:aws:iam::123456789012:role/dataprepper-role
 ```
 {% include copy.html %}
@@ -48,9 +48,10 @@ csv-pipeline:
 
 When run, the processor will parse the message. Although only two column names are specified in processor settings, a third column name is automatically generated because the data contained in `ingest.csv` includes three columns, `1,2,3`:
 
-```
+```json
 {"message": "1,2,3", "col1": "1", "col2": "2", "column3": "3"}
 ```
+
 ### Automatically detect column names
 
 The following configuration automatically detects the header of a CSV file ingested through an [`s3 source`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/):
@@ -80,7 +81,7 @@ csv-s3-pipeline:
 
 For example, if the `ingest.csv` file in the Amazon Simple Storage Service (Amazon S3) bucket that the Amazon Simple Queue Service (SQS) queue is attached to contains the following data:
 
-```
+```text
 Should,skip,this,line
 a,b,c
 1,2,3