Merge branch 'main' into osb-randomization-doc

opensearch-project · Jan 2, 2025 · 1135203 · 1135203
2 parents 543b91f + b52ec2f
commit 1135203
Show file tree

Hide file tree

Showing 6 changed files with 171 additions and 11 deletions.
diff --git a/_analyzers/tokenizers/character-group.md b/_analyzers/tokenizers/character-group.md
@@ -0,0 +1,124 @@
+---
+layout: default
+title: Character group
+parent: Tokenizers
+nav_order: 20
+has_children: false
+has_toc: false
+---
+
+# Character group tokenizer
+
+The `char_group` tokenizer splits text into tokens using specific characters as delimiters. It is suitable for situations requiring straightforward tokenization, offering a simpler alternative to pattern-based tokenizers without the added complexity.
+
+## Example usage
+
+The following example request creates a new index named `my_index` and configures an analyzer with a `char_group` tokenizer. The tokenizer splits text on white space, `-`, and `:` characters:
+
+```json
+PUT /my_index
+{
+  "settings": {
+    "analysis": {
+      "tokenizer": {
+        "my_char_group_tokenizer": {
+          "type": "char_group",
+          "tokenize_on_chars": [
+            "whitespace",
+            "-",
+            ":"
+          ]
+        }
+      },
+      "analyzer": {
+        "my_char_group_analyzer": {
+          "type": "custom",
+          "tokenizer": "my_char_group_tokenizer"
+        }
+      }
+    }
+  },
+  "mappings": {
+    "properties": {
+      "content": {
+        "type": "text",
+        "analyzer": "my_char_group_analyzer"
+      }
+    }
+  }
+}
+```
+{% include copy-curl.html %}
+
+## Generated tokens
+
+Use the following request to examine the tokens generated using the analyzer:
+
+```json
+POST /my_index/_analyze
+{
+  "analyzer": "my_char_group_analyzer",
+  "text": "Fast-driving cars: they drive fast!"
+}
+```
+{% include copy-curl.html %}
+
+The response contains the generated tokens:
+
+```json
+{
+  "tokens": [
+    {
+      "token": "Fast",
+      "start_offset": 0,
+      "end_offset": 4,
+      "type": "word",
+      "position": 0
+    },
+    {
+      "token": "driving",
+      "start_offset": 5,
+      "end_offset": 12,
+      "type": "word",
+      "position": 1
+    },
+    {
+      "token": "cars",
+      "start_offset": 13,
+      "end_offset": 17,
+      "type": "word",
+      "position": 2
+    },
+    {
+      "token": "they",
+      "start_offset": 19,
+      "end_offset": 23,
+      "type": "word",
+      "position": 3
+    },
+    {
+      "token": "drive",
+      "start_offset": 24,
+      "end_offset": 29,
+      "type": "word",
+      "position": 4
+    },
+    {
+      "token": "fast!",
+      "start_offset": 30,
+      "end_offset": 35,
+      "type": "word",
+      "position": 5
+    }
+  ]
+}
+```
+
+## Parameters
+
+The `char_group` tokenizer can be configured with the following parameters.
+
+| **Parameter**        | **Required/Optional** | **Data type** | **Description** |
+| :--- |  :--- |  :--- |  :--- |  
+| `tokenize_on_chars`   | Required              | Array         | Specifies a set of characters on which the text should be tokenized. You can specify single characters (for example, `-` or `@`), including escape characters (for example, `\n`), or character classes such as `whitespace`, `letter`, `digit`, `punctuation`, or `symbol`. |
+| `max_token_length`    | Optional              | Integer       | Sets the maximum length of the produced token. If this length is exceeded, the token is split into multiple tokens at the length configured in `max_token_length`. Default is `255`.  |
diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md
@@ -206,6 +206,7 @@ Select the arrow to view the example response.
           "suggest_total": 0,
           "suggest_time_in_millis": 0,
           "suggest_current": 0,
+          "search_idle_reactivate_count_total": 0,
           "request" : {
             "dfs_pre_query" : {
               "time_in_millis" : 0,
@@ -892,6 +893,7 @@ search.point_in_time_current | Integer | The number of shard PIT contexts curren
 search.suggest_total | Integer | The total number of shard suggest operations.
 search.suggest_time_in_millis | Integer | The total amount of time for all shard suggest operations, in milliseconds.
 search.suggest_current | Integer | The number of shard suggest operations that are currently running.
+search.search_idle_reactivate_count_total | Integer | The total number of times that all shards have been activated from an idle state.
 search.request | Object | Statistics about coordinator search operations for the node.
 search.request.took.time_in_millis | Integer | The total amount of time taken for all search requests, in milliseconds.
 search.request.took.current | Integer | The number of search requests that are currently running.

diff --git a/_clients/python-low-level.md b/_clients/python-low-level.md
@@ -106,7 +106,7 @@ client = OpenSearch(
 
 ## Connecting to Amazon OpenSearch Service
 
-The following example illustrates connecting to Amazon OpenSearch Service:
+The following example illustrates connecting to Amazon OpenSearch Service using IAM credentials:
 
 ```python
 from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
@@ -127,6 +127,25 @@ client = OpenSearch(
     pool_maxsize = 20
 )
 ```
+
+To connect to Amazon OpenSearch Service through HTTP with a username and password, use the following code:
+
+```python
+from opensearchpy import OpenSearch
+
+auth = ('admin', 'admin') # For testing only. Don't store credentials in code.
+
+client = OpenSearch(
+    hosts=[{"host": host, "port": 443}],
+    http_auth=auth,
+    http_compress=True,  # enables gzip compression for request bodies
+    use_ssl=True,
+    verify_certs=True,
+    ssl_assert_hostname=False,
+    ssl_show_warn=False,
+)
+```
+
 {% include copy.html %}
 
 ## Connecting to Amazon OpenSearch Serverless
@@ -359,4 +378,4 @@ print(response)
 ## Next steps
 
 - For Python client API, see the [`opensearch-py` API documentation](https://opensearch-project.github.io/opensearch-py/).
-- For Python code samples, see [Samples](https://github.com/opensearch-project/opensearch-py/tree/main/samples).
+- For Python code samples, see [Samples](https://github.com/opensearch-project/opensearch-py/tree/main/samples).
diff --git a/_ingest-pipelines/processors/sparse-encoding.md b/_ingest-pipelines/processors/sparse-encoding.md
@@ -36,13 +36,31 @@ The following table lists the required and optional parameters for the `sparse_e
 | Parameter  | Data type | Required/Optional  | Description  |
 |:---|:---|:---|:---|
 `model_id` | String | Required | The ID of the model that will be used to generate the embeddings. The model must be deployed in OpenSearch before it can be used in neural search. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/using-ml-models/) and [Neural sparse search]({{site.url}}{{site.baseurl}}/search-plugins/neural-sparse-search/).
+`prune_type` | String | Optional | The prune strategy for sparse vectors. Valid values are `max_ratio`, `alpha_mass`, `top_k`, `abs_value`, and `none`. Default is `none`.
+`prune_ratio` | Float | Optional | The ratio for the pruning strategy. Required when `prune_type` is specified.
 `field_map` | Object | Required | Contains key-value pairs that specify the mapping of a text field to a `rank_features` field.
 `field_map.<input_field>` | String | Required | The name of the field from which to obtain text for generating vector embeddings.
 `field_map.<vector_field>`  | String | Required | The name of the vector field in which to store the generated vector embeddings.
 `description`  | String | Optional  | A brief description of the processor.  |
 `tag` | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
 `batch_size` | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is `1`. |
 
+### Pruning sparse vectors
+
+A sparse vector often has a long-tail distribution of token weights, with less important tokens occupying a significant amount of storage space. Pruning reduces the size of an index by removing tokens with lower semantic importance, yielding a slight decrease in search relevance in exchange for a more compact index.
+
+The `sparse_encoding` processor can be used to prune sparse vectors by configuring the `prune_type` and `prune_ratio` parameters. The following table lists the supported pruning options for the `sparse_encoding` processor. 
+
+| Pruning type  | Valid pruning ratio | Description  |
+|:---|:---|:---|
+`max_ratio` | Float [0, 1) | Prunes a sparse vector by keeping only elements whose values are within the `prune_ratio` of the largest value in the vector.
+`abs_value` | Float (0, +∞) | Prunes a sparse vector by removing elements with values lower than the `prune_ratio`.
+`alpha_mass` | Float [0, 1) | Prunes a sparse vector by keeping only elements whose cumulative sum of values is within the `prune_ratio` of the total sum.
+`top_k` | Integer (0, +∞) | Prunes a sparse vector by keeping only the top `prune_ratio` elements.
+none | N/A | Leaves sparse vectors unchanged.
+
+Among all pruning options, specifying `max_ratio` as equal to `0.1` demonstrates strong generalization on test datasets. This approach reduces storage requirements by approximately 40% while incurring less than a 1% loss in search relevance.
+
 ## Using the processor
 
 Follow these steps to use the processor in a pipeline. You must provide a model ID when creating the processor. For more information, see [Using custom models within OpenSearch]({{site.url}}{{site.baseurl}}/ml-commons-plugin/using-ml-models/). 
@@ -59,6 +77,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline
     {
       "sparse_encoding": {
         "model_id": "aP2Q8ooBpBj3wT4HVS8a",
+        "prune_type": "max_ratio",
+        "prune_ratio": 0.1,
         "field_map": {
           "passage_text": "passage_embedding"
         }
@@ -111,23 +131,15 @@ The response confirms that in addition to the `passage_text` field, the processo
             "worlds" : 2.7839446,
             "yes" : 0.75845814,
             "##world" : 2.5432441,
-            "born" : 0.2682308,
             "nothing" : 0.8625516,
-            "goodbye" : 0.17146169,
             "greeting" : 0.96817183,
             "birth" : 1.2788506,
-            "come" : 0.1623208,
-            "global" : 0.4371151,
-            "it" : 0.42951578,
             "life" : 1.5750692,
-            "thanks" : 0.26481047,
             "world" : 4.7300377,
-            "tiny" : 0.5462298,
             "earth" : 2.6555297,
             "universe" : 2.0308156,
             "worldwide" : 1.3903781,
             "hello" : 6.696973,
-            "so" : 0.20279501,
             "?" : 0.67785245
           },
           "passage_text" : "hello world"

diff --git a/_search-plugins/neural-sparse-with-pipelines.md b/_search-plugins/neural-sparse-with-pipelines.md
@@ -229,6 +229,8 @@ PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
     {
       "sparse_encoding": {
         "model_id": "<bi-encoder or doc-only model ID>",
+        "prune_type": "max_ratio",
+        "prune_ratio": 0.1,
         "field_map": {
           "passage_text": "passage_embedding"
         }

diff --git a/_search-plugins/search-pipelines/neural-sparse-query-two-phase-processor.md b/_search-plugins/search-pipelines/neural-sparse-query-two-phase-processor.md
@@ -23,7 +23,8 @@ Field | Data type | Description
 :--- | :--- | :---
 `enabled` | Boolean | Controls whether the two-phase processor is enabled. Default is `true`.
 `two_phase_parameter` | Object | A map of key-value pairs representing the two-phase parameters and their associated values. You can specify the value of `prune_ratio`, `expansion_rate`, `max_window_size`, or any combination of these three parameters. Optional.
-`two_phase_parameter.prune_ratio` | Float | A ratio that represents how to split the high-weight tokens and low-weight tokens. The threshold is the token's maximum score multiplied by its `prune_ratio`. Valid range is [0,1]. Default is `0.4`
+`two_phase_parameter.prune_type` | String | The pruning strategy for separating high-weight and low-weight tokens. Default is `max_ratio`. For valid values, see [Pruning sparse vectors]({{site.url}}{{site.baseurl}}/api-reference/ingest-apis/processors/sparse-encoding/#pruning-sparse-vectors).
+`two_phase_parameter.prune_ratio` | Float | This ratio defines how high-weight and low-weight tokens are separated. The threshold is calculated by multiplying the token's maximum score by its `prune_ratio`. Valid values are in the [0,1] range for `prune_type` set to `max_ratio`. Default is `0.4`.
 `two_phase_parameter.expansion_rate` | Float | The rate at which documents will be fine-tuned during the second phase. The second-phase document number equals the query size (default is 10) multiplied by its expansion rate. Valid range is greater than 1.0. Default is `5.0`
 `two_phase_parameter.max_window_size` | Int | The maximum number of documents that can be processed using the two-phase processor. Valid range is greater than 50. Default is `10000`.
 `tag` | String | The processor's identifier. Optional.