|
| 1 | +--- |
| 2 | +layout: default |
| 3 | +title: Expanding a workload's data corpus |
| 4 | +nav_order: 20 |
| 5 | +parent: Optimizing benchmarks |
| 6 | +grand_parent: User guide |
| 7 | +--- |
| 8 | + |
| 9 | +# Expanding a workload's data corpus |
| 10 | + |
| 11 | +This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for an OpenSearch Benchmark workload. This can be helpful when running the `http_logs` workload against a large OpenSearch cluster. |
| 12 | + |
| 13 | +This script only works with the `http_logs` workload. |
| 14 | +{: .warning} |
| 15 | + |
| 16 | +## Prerequisites |
| 17 | + |
| 18 | +To use this tutorial, make sure you fulfill the following prerequisites: |
| 19 | + |
| 20 | +1. You have installed Python 3.x or later. |
| 21 | +2. The `http_logs` workload data corpus is already stored on the load generation host running OpenSearch Benchmark. |
| 22 | + |
| 23 | +## Understanding the script |
| 24 | + |
| 25 | +The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload corpus. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster. |
| 26 | + |
| 27 | +## Using `expand-data-corpus.py` |
| 28 | + |
| 29 | +To use `expand-data-corpus.py`, use the following syntax: |
| 30 | + |
| 31 | +```bash |
| 32 | +./expand-data-corpus.py [options] |
| 33 | +``` |
| 34 | + |
| 35 | +The script provides several customization options. The following are the most commonly used options: |
| 36 | + |
| 37 | +- `--corpus-size`: The desired corpus size in GB |
| 38 | +- `--output-file-suffix`: The suffix for the output file name. |
| 39 | + |
| 40 | +## Example |
| 41 | + |
| 42 | +The following example script command generates a 100 GB corpus: |
| 43 | + |
| 44 | +```bash |
| 45 | +./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb |
| 46 | +``` |
| 47 | + |
| 48 | +The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus. |
| 49 | + |
| 50 | +You can generate multiple corpora by running the script multiple times with different output suffixes. All corpora generated by the script are used by OpenSearch Benchmark sequentially during injection. |
| 51 | + |
| 52 | +## Verifying the documents |
| 53 | + |
| 54 | +After the script completes, check the following locations for new files: |
| 55 | + |
| 56 | +- In the OpenSearch Benchmark data directory for `http_logs`: |
| 57 | + - `documents-100gb.json`: The generated corpus |
| 58 | + - `documents-100gb.json.offset`: The associated offset file |
| 59 | + |
| 60 | +- In the `http_logs` workload directory: |
| 61 | + - `gen-docs-100gb.json`: The metadata for the generated corpus |
| 62 | + - `gen-idx-100gb.json`: The index specification for the generated corpus |
| 63 | + |
| 64 | +## Using the corpus in a test |
| 65 | + |
| 66 | +To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax: |
| 67 | + |
| 68 | +```bash |
| 69 | +opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options] |
| 70 | +``` |
| 71 | + |
| 72 | +The `generated_corpus:t` parameter tells OpenSearch Benchmark to use the expanded corpus. Any additional workload parameters can be appended using commas in the `--workload-params` option. |
| 73 | + |
| 74 | +## Expert-level settings |
| 75 | + |
| 76 | +Use `--help` to see all of the script's supported options. Be cautious when using the following expert-level settings because they may affect the corpus structure: |
| 77 | + |
| 78 | +- `-f`: Specifies the input file to use as a base for generating new documents |
| 79 | +- `-n`: Sets the number of documents to generate instead of the corpus size |
| 80 | +- `-i`: Defines the interval between consecutive timestamps |
| 81 | +- `-t`: Sets the starting timestamp for the generated documents |
| 82 | +- `-b`: Defines the number of documents per batch when writing to the offset file |
| 83 | + |
0 commit comments