Add expand data corpus instructions (opensearch-project#8807)

Naarcha-AWS · natebower · web-flow · commit fadfee3e2f3b · 2024-12-16T12:25:59.000-06:00
* Add expand data corpus instructions

Signed-off-by: Archer &lt;naarcha@amazon.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Apply suggestions from code review

Co-authored-by: Nathan Bower &lt;nbower@amazon.com&gt;
Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Nathan Bower &lt;nbower@amazon.com&gt;

* Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md

Signed-off-by: Nathan Bower &lt;nbower@amazon.com&gt;

---------

Signed-off-by: Archer &lt;naarcha@amazon.com&gt;
Signed-off-by: Naarcha-AWS &lt;97990722+Naarcha-AWS@users.noreply.github.com&gt;
Signed-off-by: Nathan Bower &lt;nbower@amazon.com&gt;
Co-authored-by: Nathan Bower &lt;nbower@amazon.com&gt;
diff --git a/_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md b/_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md
@@ -0,0 +1,83 @@
+---
+layout: default
+title: Expanding a workload's data corpus
+nav_order: 20
+parent: Optimizing benchmarks
+grand_parent: User guide
+---
+
+# Expanding a workload's data corpus
+
+This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for an OpenSearch Benchmark workload. This can be helpful when running the `http_logs` workload against a large OpenSearch cluster.
+
+This script only works with the `http_logs` workload.
+{: .warning}
+
+## Prerequisites
+
+To use this tutorial, make sure you fulfill the following prerequisites:
+
+1. You have installed Python 3.x or later.
+2. The `http_logs` workload data corpus is already stored on the load generation host running OpenSearch Benchmark.
+
+## Understanding the script
+
+The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload corpus. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster.
+
+## Using `expand-data-corpus.py`
+
+To use `expand-data-corpus.py`, use the following syntax:
+
+```bash
+./expand-data-corpus.py [options]
+```
+
+The script provides several customization options. The following are the most commonly used options:
+
+- `--corpus-size`: The desired corpus size in GB
+- `--output-file-suffix`: The suffix for the output file name.
+
+## Example
+
+The following example script command generates a 100 GB corpus:
+
+```bash
+./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb
+```
+
+The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus.
+
+You can generate multiple corpora by running the script multiple times with different output suffixes. All corpora generated by the script are used by OpenSearch Benchmark sequentially during injection. 
+
+## Verifying the documents
+
+After the script completes, check the following locations for new files:
+
+- In the OpenSearch Benchmark data directory for `http_logs`:
+   - `documents-100gb.json`: The generated corpus
+   - `documents-100gb.json.offset`: The associated offset file
+
+- In the `http_logs` workload directory:
+   - `gen-docs-100gb.json`: The metadata for the generated corpus
+   - `gen-idx-100gb.json`: The index specification for the generated corpus
+
+## Using the corpus in a test
+
+To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax:
+
+```bash
+opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options]
+```
+
+The `generated_corpus:t` parameter tells OpenSearch Benchmark to use the expanded corpus. Any additional workload parameters can be appended using commas in the `--workload-params` option.
+
+## Expert-level settings
+
+Use `--help` to see all of the script's supported options. Be cautious when using the following expert-level settings because they may affect the corpus structure:
+
+- `-f`: Specifies the input file to use as a base for generating new documents
+- `-n`: Sets the number of documents to generate instead of the corpus size
+- `-i`: Defines the interval between consecutive timestamps
+- `-t`: Sets the starting timestamp for the generated documents
+- `-b`: Defines the number of documents per batch when writing to the offset file
+