-
Notifications
You must be signed in to change notification settings - Fork 507
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
- Loading branch information
1 parent
6d9581d
commit a4404f2
Showing
1 changed file
with
83 additions
and
0 deletions.
There are no files selected for viewing
83 changes: 83 additions & 0 deletions
83
_benchmark/user-guide/optimizing-benchmarks/expand-data-corpus.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
--- | ||
layout: default | ||
title: Expanding a workload's data corpus | ||
nav_order: 20 | ||
parent: Optimizing benchmarks | ||
grand_parent: User guide | ||
--- | ||
|
||
# Expanding a workload's data corpus | ||
|
||
This tutorial shows you how to use the [`expand-data-corpus.py`](https://github.com/opensearch-project/opensearch-benchmark/blob/main/scripts/expand-data-corpus.py) script to increase the size of the data corpus for an OpenSearch Benchmark workload. This can be helpful when running the `http_logs` workload against a large OpenSearch cluster. | ||
|
||
This script only works with the `http_logs` workload. | ||
{: .warning} | ||
|
||
## Prerequisites | ||
|
||
To use this tutorial, make sure you fulfill the following prerequisites: | ||
|
||
1. You have installed Python 3.x or later. | ||
2. The `http_logs` workload data corpus is already stored on the load generation host running OpenSearch Benchmark. | ||
|
||
## Understanding the script | ||
|
||
The `expand-data-corpus.py` script is designed to generate a larger data corpus by duplicating and modifying existing documents from the `http_logs` workload corpus. It primarily adjusts the timestamp field while keeping other fields intact. It also generates an offset file, which enables OpenSearch Benchmark to start up faster. | ||
|
||
## Using `expand-data-corpus.py` | ||
|
||
To use `expand-data-corpus.py`, use the following syntax: | ||
|
||
```bash | ||
./expand-data-corpus.py [options] | ||
``` | ||
|
||
The script provides several customization options. The following are the most commonly used options: | ||
|
||
- `--corpus-size`: The desired corpus size in GB | ||
- `--output-file-suffix`: The suffix for the output file name. | ||
|
||
## Example | ||
|
||
The following example script command generates a 100 GB corpus: | ||
|
||
```bash | ||
./expand-data-corpus.py --corpus-size 100 --output-file-suffix 100gb | ||
``` | ||
|
||
The script will start generating documents. For a 100 GB corpus, it can take up to 30 minutes to generate the full corpus. | ||
|
||
You can generate multiple corpora by running the script multiple times with different output suffixes. All corpora generated by the script are used by OpenSearch Benchmark sequentially during injection. | ||
|
||
## Verifying the documents | ||
|
||
After the script completes, check the following locations for new files: | ||
|
||
- In the OpenSearch Benchmark data directory for `http_logs`: | ||
- `documents-100gb.json`: The generated corpus | ||
- `documents-100gb.json.offset`: The associated offset file | ||
|
||
- In the `http_logs` workload directory: | ||
- `gen-docs-100gb.json`: The metadata for the generated corpus | ||
- `gen-idx-100gb.json`: The index specification for the generated corpus | ||
|
||
## Using the corpus in a test | ||
|
||
To use the newly generated corpus in an OpenSearch Benchmark test, use the following syntax: | ||
|
||
```bash | ||
opensearch-benchmark execute-test --workload http_logs --workload-params=generated_corpus:t [other_options] | ||
``` | ||
|
||
The `generated_corpus:t` parameter tells OpenSearch Benchmark to use the expanded corpus. Any additional workload parameters can be appended using commas in the `--workload-params` option. | ||
|
||
## Expert-level settings | ||
|
||
Use `--help` to see all of the script's supported options. Be cautious when using the following expert-level settings because they may affect the corpus structure: | ||
|
||
- `-f`: Specifies the input file to use as a base for generating new documents | ||
- `-n`: Sets the number of documents to generate instead of the corpus size | ||
- `-i`: Defines the interval between consecutive timestamps | ||
- `-t`: Sets the starting timestamp for the generated documents | ||
- `-b`: Defines the number of documents per batch when writing to the offset file | ||
|