Skip to content

Commit

Permalink
Address linter issues
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Alfonsi <[email protected]>
  • Loading branch information
Peter Alfonsi committed Dec 31, 2024
1 parent 0f17a0e commit 543b91f
Showing 1 changed file with 8 additions and 8 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ parent: Optimizing benchmarks
grand_parent: User guide
---

# Randomizing Queries
# Randomizing queries

By default, OpenSearch Benchmark runs identical queries for some number of iterations. But this isn't suitable for all tests. For example, when testing caching behavior, running many iterations of the same query would cause 1 miss and many hits, which doesn't approximate real usage very well.

Expand Down Expand Up @@ -39,15 +39,15 @@ To get cache hits, we can't completely randomize this; we have to reuse the same

Every time a query is sent to OpenSearch, some fraction `rf` (short for "repeat-frequency") of the time, we draw a pair from this saved list. This pair may have been seen before, so it could cause a cache hit. For example, if `rf` = 0.7, the cache hit ratio could be up to 70%. In practice, this may or may not be a hit, depending on benchmark duration and cache size.

We draw saved pairs based on the Zipf distribution, which empirically matches usage traces for many real caches. Pair `i` is drawn with probability proportional to `1 / i^alpha`, where `alpha` is another parameter controlling how spread out the distribution is. So, pairs with small indices are drawn much more often than ones with large indices.
We draw saved pairs based on the Zipf distribution, which empirically matches usage traces for many real caches. Pair `i` is drawn with probability proportional to `1 / i^alpha`, where `alpha` is another parameter controlling how spread out the distribution is. So, pairs with small indexes are drawn much more often than ones with large indexes.

Check failure on line 42 in _benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md

View workflow job for this annotation

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Zipf. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Zipf. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_benchmark/user-guide/optimizing-benchmarks/randomizing-queries.md", "range": {"start": {"line": 42, "column": 34}}}, "severity": "ERROR"}

Otherwise, the other `1-rf` fraction of the time, we generate a totally new random pair of values. This will not have been seen before, so it should be a cache miss.

## Usage

To use this feature on a workload you must make some changes to your workload's `workload.py` and supply some flags when running OSB.

### CLI Flags
### CLI flags

`--randomization-enabled` turns randomization on and off.

Expand All @@ -59,9 +59,9 @@ To use this feature on a workload you must make some changes to your workload's

### Changes in workload.py

You specify how to generate the saved value pairs for each operation by registering a "standard value source" for that operation. This is just a Python function that takes no arguments and returns a dict with keys matching the keys in the query which should be randomized. Usually this function would have randomness in it. Finally, you modify the `register()` method to register this function with the operation name and field name which is randomized.
You specify how to generate the saved value pairs for each operation by registering a "standard value source" for that operation. This is a Python function that takes no arguments and returns a dict with keys matching the keys in the query which should be randomized. Usually this function would have randomness in it. Finally, you modify the `register()` method to register this function with the operation name and field name which is randomized.

For example, to randomize the `"total_amount"` field in the `"range"` operation from above, a standard value source might look like:
For example, to randomize the `"total_amount"` field in the `"range"` operation from earlier, a standard value source might look like:

```
def random_money_values(max_value):
Expand All @@ -83,7 +83,7 @@ def register(registry):
registry.register_standard_value_source("range", "total_amount", range_query_standard_value_source)
```

There may already be code in this function. Leave it there if so. If `workload.py` does not exist or lacks a `register(registry)` function, you can just create them.
There may already be code in this function. Leave it there if so. If `workload.py` does not exist or lacks a `register(registry)` function, you can create them.

#### Randomizing non-range queries

Expand Down Expand Up @@ -118,13 +118,13 @@ registry.register_query_randomization_info("bbox", "geo_bounding_box", [["top_le

The first argument, `"bbox"`, is the operation name. The second argument, `"geo_bounding_box"`, is the query type name.

The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's just one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized.
The third argument is a list of lists: `[["top_left"], ["bottom_right"]]`. Each entry in the outer list represents one parameter name that will be randomized. It's a list because we may have multiple different versions of the same name that represent roughly the same thing. For example, `"gte"` or `"gt"`. In this case there's only one option for each parameter name. At least one version of each parameter name must be present in the original query for it to be randomized.

The last argument is a list of optional parameters. If an optional parameter is present in the random standard value source, it will be put into the randomized version of the query. If it's not in the source, it's ignored. There are no optional parameters in this example, but the typical use case would be `"format"` in a range query.

If nothing is registered, it falls back to the default; equivalent to registering `registry.register_query_randomization_info(<operation_name>, "range", [["gte", "gt"], ["lte", "lt"]], ["format"])`.

The dict returned by the standard value source should match the parameter names you are trying to randomize. For example the standard value source for the above example is:
The dict returned by the standard value source should match the parameter names you are trying to randomize. For example the standard value source for the earlier example is:

```
def bounding_box_source():
Expand Down

0 comments on commit 543b91f

Please sign in to comment.