Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions _data-prepper/common-use-cases/log-analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ log-pipeline:
# Change to your credentials
username: "admin"
password: "admin"
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
#cert: /path/to/cert
# If you are connecting to an Amazon OpenSearch Service domain without
# Fine-Grained Access Control, enable these settings. Comment out the
Expand All @@ -78,6 +78,7 @@ log-pipeline:
# You should change this to correspond with how your OpenSearch indexes are set up.
index: apache_logs
```
{% include copy.html %}

This pipeline configuration is an example of Apache log ingestion. Don't forget that you can easily configure the Grok Processor for your own custom logs. You will need to modify the configuration for your OpenSearch cluster.

Expand All @@ -100,7 +101,7 @@ Note that you should adjust the file `path`, output `Host`, and `Port` according

The following is an example `fluent-bit.conf` file without SSL and basic authentication enabled on the HTTP source:

```
```text
[INPUT]
name tail
refresh_interval 5
Expand All @@ -115,14 +116,15 @@ The following is an example `fluent-bit.conf` file without SSL and basic authent
URI /log/ingest
Format json
```
{% include copy.html %}

If your HTTP source has SSL and basic authentication enabled, you will need to add the details of `http_User`, `http_Passwd`, `tls.crt_file`, and `tls.key_file` to the `fluent-bit.conf` file, as shown in the following example.

### Example: Fluent Bit file with SSL and basic authentication enabled

The following is an example `fluent-bit.conf` file with SSL and basic authentication enabled on the HTTP source:

```
```text
[INPUT]
name tail
refresh_interval 5
Expand All @@ -142,6 +144,7 @@ The following is an example `fluent-bit.conf` file with SSL and basic authentica
URI /log/ingest
Format json
```
{% include copy.html %}

# Next steps

Expand Down
38 changes: 21 additions & 17 deletions _data-prepper/common-use-cases/trace-analytics.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ The following example demonstrates how to build a pipeline that supports the [Op

Starting with Data Prepper version 2.0, Data Prepper no longer supports the `otel_traces_prepper` processor. The `otel_traces` processor replaces the `otel_traces_prepper` processor and supports some of Data Prepper's recent data model changes. Instead, you should use the `otel_traces` processor. See the following YAML file example:

```yml
```yaml
entry-pipeline:
delay: "100"
source:
Expand Down Expand Up @@ -167,6 +167,7 @@ service-map-pipeline:
password: admin
index_type: trace-analytics-service-map
```
{% include copy.html %}

To maintain similar ingestion throughput and latency, scale the `buffer_size` and `batch_size` by the estimated maximum batch size in the client request payload. {: .tip}

Expand All @@ -186,21 +187,22 @@ source:
username: "my-user"
password: "my_s3cr3t"
```
{% include copy.html %}

#### Example: pipeline.yaml

The following is an example `pipeline.yaml` file without SSL and basic authentication enabled for the `otel-trace-pipeline` pipeline:

```yaml
otel-trace-pipeline:
# workers is the number of threads processing data in each pipeline.
# workers is the number of threads processing data in each pipeline.
# We recommend same value for all pipelines.
# default value is 1, set a value based on the machine you are running Data Prepper
workers: 8
workers: 8
# delay in milliseconds is how often the worker threads should process data.
# Recommend not to change this config as we want the entry-pipeline to process as quick as possible
# default value is 3_000 ms
delay: "100"
delay: "100"
source:
otel_trace_source:
#record_type: event # Add this when using Data Prepper 1.x. This option is removed in 2.0
Expand All @@ -209,8 +211,8 @@ otel-trace-pipeline:
unauthenticated:
buffer:
bounded_blocking:
# buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
# We recommend to keep the same buffer_size for all pipelines.
# buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
# We recommend to keep the same buffer_size for all pipelines.
# Make sure you configure sufficient heap
# default value is 512
buffer_size: 512
Expand All @@ -225,9 +227,9 @@ otel-trace-pipeline:
name: "entry-pipeline"
raw-trace-pipeline:
# Configure same as the otel-trace-pipeline
workers: 8
workers: 8
# We recommend using the default value for the raw-trace-pipeline.
delay: "3000"
delay: "3000"
source:
pipeline:
name: "entry-pipeline"
Expand All @@ -248,7 +250,7 @@ raw-trace-pipeline:
# Change to your credentials
username: "admin"
password: "admin"
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
#cert: /path/to/cert
# If you are connecting to an Amazon OpenSearch Service domain without
# Fine-Grained Access Control, enable these settings. Comment out the
Expand All @@ -262,7 +264,7 @@ raw-trace-pipeline:
# Change to your credentials
username: "admin"
password: "admin"
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
#cert: /path/to/cert
# If you are connecting to an Amazon OpenSearch Service domain without
# Fine-Grained Access Control, enable these settings. Comment out the
Expand All @@ -277,14 +279,14 @@ service-map-pipeline:
name: "entry-pipeline"
processor:
- service_map:
# The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships.
# The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships.
# The default is 3 minutes, this means we can detect relationships between services from spans reported in last 3 minutes.
# Set higher value if your applications have higher latency.
window_duration: 180
# Set higher value if your applications have higher latency.
window_duration: 180
buffer:
bounded_blocking:
# buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
# We recommend to keep the same buffer_size for all pipelines.
# buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
# We recommend to keep the same buffer_size for all pipelines.
# Make sure you configure sufficient heap
# default value is 512
buffer_size: 512
Expand All @@ -299,14 +301,15 @@ service-map-pipeline:
# Change to your credentials
username: "admin"
password: "admin"
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
# Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
#cert: /path/to/cert
# If you are connecting to an Amazon OpenSearch Service domain without
# Fine-Grained Access Control, enable these settings. Comment out the
# username and password above.
#aws_sigv4: true
#aws_region: us-east-1
```
{% include copy.html %}

You need to modify the preceding configuration for your OpenSearch cluster so that the configuration matches your environment. Note that it has two `opensearch` sinks that need to be modified.
{: .note}
Expand All @@ -328,7 +331,7 @@ You need to run OpenTelemetry Collector in your service environment. Follow [Get

The following is an example `otel-collector-config.yaml` file:

```
```yaml
receivers:
jaeger:
protocols:
Expand Down Expand Up @@ -356,6 +359,7 @@ service:
processors: [batch/traces]
exporters: [otlp/data-prepper]
```
{% include copy.html %}

After you run OpenTelemetry in your service environment, you must configure your application to use the OpenTelemetry Collector. The OpenTelemetry Collector typically runs alongside your application.

Expand Down
21 changes: 12 additions & 9 deletions _data-prepper/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ There are two ways to install Data Prepper: you can run the Docker image or buil

The easiest way to use Data Prepper is by running the Docker image. We suggest that you use this approach if you have [Docker](https://www.docker.com) available. Run the following command:

```
```bash
docker pull opensearchproject/data-prepper:latest
```
{% include copy.html %}
Expand All @@ -36,27 +36,30 @@ Two configuration files are required to run a Data Prepper instance. Optionally,

For Data Prepper versions earlier than 2.0, the `.jar` file expects the pipeline configuration file path to be followed by the server configuration file path. See the following configuration path example:

```
```bash
java -jar data-prepper-core-$VERSION.jar pipelines.yaml data-prepper-config.yaml
```
{% include copy.html %}

Optionally, you can add `"-Dlog4j.configurationFile=config/log4j2.properties"` to the command to pass a custom Log4j 2 configuration file. If you don't provide a properties file, Data Prepper defaults to the `log4j2.properties` file in the `shared-config` directory.


Starting with Data Prepper 2.0, you can launch Data Prepper by using the following `data-prepper` script that does not require any additional command line arguments:

```
```bash
bin/data-prepper
```
{% include copy.html %}

Configuration files are read from specific subdirectories in the application's home directory:
1. `pipelines/`: Used for pipeline configurations. Pipeline configurations can be written in one or more YAML files.
2. `config/data-prepper-config.yaml`: Used for the Data Prepper server configuration.

You can supply your own pipeline configuration file path followed by the server configuration file path. However, this method will not be supported in a future release. See the following example:
```
```bash
bin/data-prepper pipelines.yaml data-prepper-config.yaml
```
{% include copy.html %}

The Log4j 2 configuration file is read from the `config/log4j2.properties` file located in the application's home directory.

Expand All @@ -69,7 +72,7 @@ To configure Data Prepper, see the following information for each use case:

Create a Data Prepper pipeline file named `pipelines.yaml` using the following configuration:

```yml
```yaml
simple-sample-pipeline:
workers: 2
delay: "5000"
Expand All @@ -96,7 +99,7 @@ The example pipeline configuration above demonstrates a simple pipeline with a s

After starting Data Prepper, you should see log output and some UUIDs after a few seconds:

```yml
```text
2021-09-30T20:19:44,147 [main] INFO com.amazon.dataprepper.pipeline.server.DataPrepperServer - Data Prepper server running at :4900
2021-09-30T20:19:44,681 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer
2021-09-30T20:19:45,183 [random-source-pool-0] INFO com.amazon.dataprepper.plugins.source.RandomStringSource - Writing to buffer
Expand All @@ -120,21 +123,21 @@ image and modify both the `pipelines.yaml` and `data-prepper-config.yaml` files.

For Data Prepper 2.0 or later, use this command:

```
```bash
docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/config/data-prepper-config.yaml opensearchproject/data-prepper:latest
```
{% include copy.html %}

For Data Prepper versions earlier than 2.0, use this command:

```
```bash
docker run --name data-prepper -p 4900:4900 -v ${PWD}/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v ${PWD}/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/data-prepper:1.x
```
{% include copy.html %}

Once Data Prepper is running, it processes data until it is shut down. Once you are done, shut it down with the following command:

```
```bash
POST /shutdown
```
{% include copy-curl.html %}
Expand Down
10 changes: 5 additions & 5 deletions _data-prepper/pipelines/configuration/processors/aggregate.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,15 +38,15 @@ The `remove_duplicates` action processes the first event for a group immediately

The `put_all` action combines events belonging to the same group by overwriting existing keys and adding new keys, similarly to the Java `Map.putAll`. The action drops all events that make up the combined event. For example, when using `identification_keys: ["sourceIp", "destination_ip"]`, the `put_all` action processes the following three events:

```
```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "http_verb": "GET" }
```

Then the action combines the events into one. The pipeline then uses the following combined event:

```
```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "status": 200, "bytes": 1000, "http_verb": "GET" }
```

Expand Down Expand Up @@ -93,7 +93,7 @@ You can customize the processor with the following configuration options:

For example, when using `identification_keys: ["sourceIp", "destination_ip", "request"]`, `key: latency`, and `buckets: [0.0, 0.25, 0.5]`, the `histogram` action processes the following events:

```
```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.2 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.55}
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "request" : "/index.html", "latency": 0.25 }
Expand Down Expand Up @@ -139,7 +139,7 @@ You can set the percentage of events using the `percent` configuration, which in

For example, if percent is set to `50`, the action tries to process the following events in the one-second interval:

```
```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 2500 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 1000 }
Expand All @@ -148,7 +148,7 @@ For example, if percent is set to `50`, the action tries to process the followin

The pipeline processes 50% of the events, drops the other events, and does not generate a new event:

```
```json
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 500 }
{ "sourceIp": "127.0.0.1", "destinationIp": "192.168.0.1", "bytes": 3100 }
```
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,14 @@ To get started, create the following `pipeline.yaml` file. You can use the follo
ad-pipeline:
source:
...
....
....
processor:
- anomaly_detector:
keys: ["latency"]
mode:
mode:
random_cut_forest:
```
{% include copy.html %}

When you run the `anomaly_detector` processor, the processor extracts the value for the `latency` key and then passes the value through the RCF ML algorithm. You can configure any key that comprises integers or real numbers as values. In the following example, you can configure `bytes` or `latency` as the key for an anomaly detector.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Field | Type | Required | Description

#### Example configuration

```
```yaml
processors:
- aws_lambda:
function_name: "my-lambda-function"
Expand All @@ -62,7 +62,6 @@ processors:
maximum_size: "5mb"
event_collect_timeout: PT10S
lambda_when: "event['status'] == 'process'"

```
{% include copy.html %}

Expand Down Expand Up @@ -98,7 +97,7 @@ Note the following limitations:

Integration tests for this plugin are executed separately from the main Data Prepper build process. Use the following Gradle command to run these tests:

```
```bash
./gradlew :data-prepper-plugins:aws-lambda:integrationTest -Dtests.processor.lambda.region="us-east-1" -Dtests.processor.lambda.functionName="lambda_test_function" -Dtests.processor.lambda.sts_role_arn="arn:aws:iam::123456789012:role/dataprepper-role
```
{% include copy.html %}
5 changes: 3 additions & 2 deletions _data-prepper/pipelines/configuration/processors/csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,10 @@ csv-pipeline:

When run, the processor will parse the message. Although only two column names are specified in processor settings, a third column name is automatically generated because the data contained in `ingest.csv` includes three columns, `1,2,3`:

```
```json
{"message": "1,2,3", "col1": "1", "col2": "2", "column3": "3"}
```

### Automatically detect column names

The following configuration automatically detects the header of a CSV file ingested through an [`s3 source`]({{site.url}}{{site.baseurl}}/data-prepper/pipelines/configuration/sources/s3/):
Expand Down Expand Up @@ -80,7 +81,7 @@ csv-s3-pipeline:

For example, if the `ingest.csv` file in the Amazon Simple Storage Service (Amazon S3) bucket that the Amazon Simple Queue Service (SQS) queue is attached to contains the following data:

```
```text
Should,skip,this,line
a,b,c
1,2,3
Expand Down
Loading