Skip to content

Commit fadfee3

Browse files
Add expand data corpus instructions (opensearch-project#8807)
* Add expand data corpus instructions Signed-off-by: Archer <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Signed-off-by: Naarcha-AWS <[email protected]> * Apply suggestions from code review Co-authored-by: Nathan Bower <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> * Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md Signed-off-by: Nathan Bower <[email protected]> * Update _benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md Signed-off-by: Nathan Bower <[email protected]> --------- Signed-off-by: Archer <[email protected]> Signed-off-by: Naarcha-AWS <[email protected]> Signed-off-by: Nathan Bower <[email protected]> Co-authored-by: Nathan Bower <[email protected]>
1 parent 066b6cd commit fadfee3

File tree

1 file changed

+83
-0
lines changed

1 file changed

+83
-0
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
---
2+
layout: default
3+
title: Expanding a workload's data corpus
4+
nav_order: 20
5+
parent: Optimizing benchmarks
6+
grand_parent: User guide
7+
---
8+
9+
# Expanding a workload's data corpus
10+
11+
This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for an OpenSearch Benchmark workload. This can be helpful when running the `http_logs` workload against a large OpenSearch cluster.
12+
13+
This script only works with the `http_logs` workload.
14+
{: .warning}
15+
16+
## Prerequisites
17+
18+
To use this tutorial, make sure you fulfill the following prerequisites:
19+
20+
1. You have installed Python 3.x or later.
21+
2. The `http_logs` workload data corpus is already stored on the load generation host running OpenSearch Benchmark.
22+
23+
## Understanding the script
24+
25+
The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload corpus. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster.
26+
27+
## Using `expand-data-corpus.py`
28+
29+
To use `expand-data-corpus.py`, use the following syntax:
30+
31+
```bash
32+
./expand-data-corpus.py [options]
33+
```
34+
35+
The script provides several customization options. The following are the most commonly used options:
36+
37+
- `--corpus-size`: The desired corpus size in GB
38+
- `--output-file-suffix`: The suffix for the output file name.
39+
40+
## Example
41+
42+
The following example script command generates a 100 GB corpus:
43+
44+
```bash
45+
./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb
46+
```
47+
48+
The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus.
49+
50+
You can generate multiple corpora by running the script multiple times with different output suffixes. All corpora generated by the script are used by OpenSearch Benchmark sequentially during injection.
51+
52+
## Verifying the documents
53+
54+
After the script completes, check the following locations for new files:
55+
56+
- In the OpenSearch Benchmark data directory for `http_logs`:
57+
- `documents-100gb.json`: The generated corpus
58+
- `documents-100gb.json.offset`: The associated offset file
59+
60+
- In the `http_logs` workload directory:
61+
- `gen-docs-100gb.json`: The metadata for the generated corpus
62+
- `gen-idx-100gb.json`: The index specification for the generated corpus
63+
64+
## Using the corpus in a test
65+
66+
To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax:
67+
68+
```bash
69+
opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options]
70+
```
71+
72+
The `generated_corpus:t` parameter tells OpenSearch Benchmark to use the expanded corpus. Any additional workload parameters can be appended using commas in the `--workload-params` option.
73+
74+
## Expert-level settings
75+
76+
Use `--help` to see all of the script's supported options. Be cautious when using the following expert-level settings because they may affect the corpus structure:
77+
78+
- `-f`: Specifies the input file to use as a base for generating new documents
79+
- `-n`: Sets the number of documents to generate instead of the corpus size
80+
- `-i`: Defines the interval between consecutive timestamps
81+
- `-t`: Sets the starting timestamp for the generated documents
82+
- `-b`: Defines the number of documents per batch when writing to the offset file
83+

0 commit comments

Comments
 (0)