From 6622693dfd7da0a89a09b89831a5ef468d8458d4 Mon Sep 17 00:00:00 2001
From: Brian Flores <iflorbri@amazon.com>
Date: Thu, 2 Jan 2025 07:33:55 -0800
Subject: [PATCH] Tutorial for using Asymmetric models (#3258)

* adds initial tutorial contents

 After replicating the local model embeddings. I am able to provide a high level solution of what the tutorial entails.

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* Add: initial thoughts to highlevel steps

Provides more context to each step

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* expand more details on tutorial

Expands on the context of the previous commit and improvements in grammar and structure

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* fix typo

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* adds Asymmetric semantic search

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* Apply Code/Writing review

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* Removed Docker requirement

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* Make clear OpenSearch is ran locally in tutorial

Signed-off-by: Brian Flores <iflorbri@amazon.com>

* Simplify explanation of file location

Signed-off-by: Brian Flores <iflorbri@amazon.com>

---------

Signed-off-by: Brian Flores <iflorbri@amazon.com>
(cherry picked from commit 33e26ba4f14c46353e625f224df1acd614b6efbc)
---
 .../asymmetric_embedding_model.md             | 532 ++++++++++++++++++
 1 file changed, 532 insertions(+)
 create mode 100644 docs/tutorials/semantic_search/asymmetric_embedding_model.md

diff --git a/docs/tutorials/semantic_search/asymmetric_embedding_model.md b/docs/tutorials/semantic_search/asymmetric_embedding_model.md
new file mode 100644
index 0000000000..1d46986ccc
--- /dev/null
+++ b/docs/tutorials/semantic_search/asymmetric_embedding_model.md
@@ -0,0 +1,532 @@
+# Tutorial: Running Asymmetric Semnantic Search within OpenSearch
+
+This tutorial demonstrates how generating text embeddings using an asymmetric embedding model in OpenSearch. The example model used in this tutorial is the multilingual
+`intfloat/multilingual-e5-small` model from Hugging Face.
+In this tutorial, you'll learn how to prepare the model, register it in OpenSearch, and run inference to generate embeddings.
+
+> **Note**: Make sure to replace all placeholders (e.g., `your_`) with your specific values.
+
+---
+
+## Prerequisites
+
+- OpenSearch installed on your machine
+- Access to the Hugging Face `intfloat/multilingual-e5-small` model (or another model of your choice).
+- Basic knowledge of Linux commands
+---
+
+## Step 1: Start OpenSearch
+
+See here for directions to install and run [OpenSearch](https://opensearch.org/docs/latest/install-and-configure/install-opensearch/index/). 
+
+Run OpenSearch and ensure the following steps are completed. In this example, we set up the OS cluster locally.
+
+###  Update cluster settings
+
+Ensure your cluster is configured to allow registering models. You can do this by updating the cluster settings using the following request:
+
+```
+PUT _cluster/settings
+{
+  "persistent": {
+    "plugins.ml_commons.allow_registering_model_via_url": "true",
+    "plugins.ml_commons.only_run_on_ml_node": "false",
+    "plugins.ml_commons.model_access_control_enabled": "true",
+    "plugins.ml_commons.native_memory_threshold": "99"
+  }
+}
+```
+
+This configuration ensures that OpenSearch can accept machine learning models from external URLs and can run models across non-ML nodes.
+
+
+---
+
+## Step 2: Prepare the model for OpenSearch
+
+In this tutorial, you’ll use the Hugging Face `intfloat/multilingual-e5-small` model, which is capable of generating multilingual embeddings. Follow these steps to prepare and zip the model for use in OpenSearch.
+
+### a. Clone the model from Hugging Face
+
+To download the model, use the following steps:
+
+1. Install Git Large File Storage (LFS) if you haven’t already:
+
+   ```
+   git lfs install
+   ```
+
+2. Clone the model repository:
+
+   ```
+   git clone https://huggingface.co/intfloat/multilingual-e5-small
+   ```
+
+This will download the model files into a directory on your local machine.
+
+### b. Zip the model files
+
+To upload the model to OpenSearch, you must zip the necessary model files (`model.onnx`, `sentencepiece.bpe.model`, and `tokenizer.json`). You can find these files located in the `onnx` directory of the cloned repository.
+
+Run the following command in the directory containing these files:
+
+```
+zip -r intfloat-multilingual-e5-small-onnx.zip model.onnx tokenizer.json sentencepiece.bpe.model
+```
+
+This command will create a zip file named `intfloat-multilingual-e5-small-onnx.zip`, with the all necessary files.
+
+### c. Calculate the model file's hash
+
+Before registering the model, you need to calculate the SHA-256 hash of the zip file. Run this command to generate the hash:
+
+```
+shasum -a 256 intfloat-multilingual-e5-small-onnx.zip
+```
+
+Note the hash value; You'll need it during the model registration.
+
+### d. Serve the model file using a Python HTTP server
+
+To allow OpenSearch to access the model file, you can serve it through HTTP. Because this tutorial uses a local development environment, you can use Python's built-in HTTP server command:
+
+Navigate to the directory containing the zip file and run the following command:
+
+```
+python3 -m http.server 8080 --bind 0.0.0.0
+```
+
+This will serve the zip file at `http://0.0.0.0:8080/intfloat-multilingual-e5-small-onnx.zip`. After registering the model, you can stop the server by pressing `Ctrl + C`.
+
+---
+
+## Step 3: Register a model group
+
+Before registering the model itself, you need to create a model group. This helps organize models in OpenSearch. Run the following request to create a new model group:
+
+```
+POST /_plugins/_ml/model_groups/_register
+{
+  "name": "Asymmetric Model Group",
+  "description": "A model group for local asymmetric models"
+}
+```
+
+Note of the `model_group_id` returned in the response; you'll use it to register the model.
+
+---
+
+## Step 4: Register the model
+
+Now that you have the model zip file and the model group ID, you can register the model in OpenSearch:
+
+```
+POST /_plugins/_ml/models/_register
+{
+    "name": "e5-small-onnx",
+    "version": "1.0.0",
+    "description": "Asymmetric multilingual-e5-small model",
+    "model_format": "ONNX",
+    "model_group_id": "your_group_id",
+    "model_content_hash_value": "your_model_zip_content_hash_value",
+    "model_config": {
+        "model_type": "bert",
+        "embedding_dimension": 384,
+        "framework_type": "sentence_transformers",
+        "query_prefix": "query: ",
+        "passage_prefix": "passage: ",
+        "all_config": "{ \"_name_or_path\": \"intfloat/multilingual-e5-small\", \"architectures\": [ \"BertModel\" ], \"attention_probs_dropout_prob\": 0.1, \"hidden_size\": 384, \"num_attention_heads\": 12, \"num_hidden_layers\": 12, \"tokenizer_class\": \"XLMRobertaTokenizer\" }"
+    },
+    "url": "http://localhost:8080/intfloat-multilingual-e5-small-onnx.zip"
+}
+```
+
+Replace `your_group_id` and `your_model_zip_content_hash_value` with the values from previous steps. This will initiate the model registration process, and you’ll receive a task ID in the response.
+
+To check the status of the registration, run the following request:
+
+```
+GET /_plugins/_ml/tasks/your_task_id
+```
+
+Once successful, note the `model_id` returned; you'll need it for deployment and inference.
+
+---
+
+## Step 5: Deploy the model
+
+After the model is registered, you can deploy it by running the following request:
+
+```
+POST /_plugins/_ml/models/your_model_id/_deploy
+```
+
+Check the status of the deployment using the task ID:
+
+```
+GET /_plugins/_ml/tasks/your_task_id
+```
+
+When the model is successfully deployed, it's state will change to the **DEPLOYED** state, and you can use it for inference.
+
+---
+
+## Step 6: Run inference
+
+Now that your model is deployed, you can use it to generate text embeddings for both queries and passages.
+
+### Generating passage embeddings
+
+To generate embeddings for a passage, use the following request:
+
+```
+POST /_plugins/_ml/_predict/text_embedding/your_model_id
+{
+  "parameters": {
+    "content_type": "passage"
+  },
+  "text_docs": [
+    "Today is Friday, tomorrow will be my break day. After that, I will go to the library. When is lunch?"
+  ],
+  "target_response": ["sentence_embedding"]
+}
+```
+
+The response will include a sentence embedding of size 384:
+
+```
+{
+  "inference_results": [
+    {
+      "output": [
+        {
+          "name": "sentence_embedding",
+          "data_type": "FLOAT32",
+          "shape": [384],
+          "data": [0.0419328, 0.047480892, ..., 0.31158513, 0.21784715]
+        }
+      ]
+    }
+  ]
+}
+```
+
+### Generating query embeddings
+
+Similarly, you can generate embeddings for a query:
+
+```
+POST /_plugins/_ml/_predict/text_embedding/your_model_id
+{
+  "parameters": {
+    "content_type": "query"
+  },
+  "text_docs": ["What day is it today?"],
+  "target_response": ["sentence_embedding"]
+}
+```
+
+The response will look like this:
+
+```
+{
+  "inference_results": [
+    {
+      "output": [
+        {
+          "name": "sentence_embedding",
+          "data_type": "FLOAT32",
+          "shape": [384],
+          "data": [0.2338349, -0.13603798, ..., 0.37335885, 0.10653384]
+        }
+      ]
+    }
+  ]
+}
+```
+
+---
+
+# Applying semantic search using an ML Inference processor
+
+In this section you'll run semantic search on facts about New York City. First, you'll create an ingest pipeline
+using the ML inference processor to create embeddings on ingestion. Then create a search pipeline to run a search using
+the same asymmetric embedding model. 
+
+
+## 2. Create an ingest pipeline
+
+### 2.1 Create a test KNN index
+```
+PUT nyc_facts
+{
+  "settings": {
+    "index": {
+      "default_pipeline": "asymmetric_embedding_ingest_pipeline",
+      "knn": true,
+      "knn.algo_param.ef_search": 100
+    }
+  },
+  "mappings": {
+    "properties":  {
+      "fact_embedding": {
+        "type": "knn_vector",
+        "dimension": 384,
+        "method": {
+          "name": "hnsw",
+          "space_type": "l2",
+          "engine": "nmslib",
+          "parameters": {
+            "ef_construction": 128,
+            "m": 24
+          }
+        } 
+      }
+    }
+  }
+}
+```
+
+### 2.2 Create an ingest pipeline
+
+```
+PUT _ingest/pipeline/asymmetric_embedding_ingest_pipeline
+{
+	"description": "ingest passage text and generate a embedding using an asymmetric model",
+	"processors": [
+		{
+			"ml_inference": {
+
+				"model_input": "{\"text_docs\":[\"${input_map.text_docs}\"],\"target_response\":[\"sentence_embedding\"],\"parameters\":{\"content_type\":\"query\"}}",
+				"function_name": "text_embedding",
+				"model_id": "{{ _.model_id }}",
+				"input_map": [
+					{
+						"text_docs": "description"
+					}
+				],
+				"output_map": [
+					{
+						"fact_embedding": "$.inference_results[0].output[0].data",
+						"embedding_size": "$.inference_results.*.output.*.shape[0]"
+					}
+				]
+			}
+		}
+	]
+}
+```
+
+### 2.3 Simulate the pipeline
+
+You can test the pipeline using the simulate endpoint. Simulate the pipeline by running the following request:
+```
+POST /_ingest/pipeline/asymmetric_embedding_ingest_pipeline/_simulate
+{
+   "docs": [
+	  {
+         "_index": "my-index",
+         "_id": "1",
+         "_source": {
+             "title": "Central Park",
+             "description": "A large public park in the heart of New York City, offering a wide range of recreational activities."
+         }
+		}
+	]
+}
+```
+The response contains the embedding generated by the model after we "ingested" a document using the pipeline.
+```
+{
+   "docs": [
+      {
+         "doc": {
+             "_index": "my-index",
+             "_id": "1",
+             "_source": {
+                 "description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
+                 "fact_embedding": [
+                     [
+                         0.06344555,
+                         0.30067796,
+                         ...
+                         0.014804064,
+                         -0.022822019						
+                     ]
+                 ],
+                 "title": "Central Park",
+                 "embedding_size": [
+                     384.0
+                 ]
+             },
+             "_ingest": {
+                 "timestamp": "2024-12-16T20:59:07.152169Z"
+             }
+         }
+      }
+	]
+}
+```
+
+### 2.4 Test data ingestion
+When you perform bulk ingestion, the ingest pipeline will generate embeddings for each document: 
+```
+POST /_bulk
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Central Park", "description": "A large public park in the heart of New York City, offering a wide range of recreational activities." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Empire State Building", "description": "An iconic skyscraper in New York City offering breathtaking views from its observation deck." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Statue of Liberty", "description": "A colossal neoclassical sculpture on Liberty Island, symbolizing freedom and democracy in the United States." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Brooklyn Bridge", "description": "A historic suspension bridge connecting Manhattan and Brooklyn, offering pedestrian walkways with great views." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Times Square", "description": "A bustling commercial and entertainment hub in Manhattan, known for its neon lights and Broadway theaters." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Yankee Stadium", "description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "The Bronx Zoo", "description": "One of the largest zoos in the world, located in the Bronx, featuring diverse animal exhibits and conservation efforts." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "New York Botanical Garden", "description": "A large botanical garden in the Bronx, known for its diverse plant collections and stunning landscapes." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Flushing Meadows-Corona Park", "description": "A major park in Queens, home to the USTA Billie Jean King National Tennis Center and the Unisphere." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Citi Field", "description": "The home stadium of the New York Mets, located in Queens, known for its modern design and fan-friendly atmosphere." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Rockefeller Center", "description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Queens Botanical Garden", "description": "A peaceful, beautiful botanical garden located in Flushing, Queens, featuring seasonal displays and plant collections." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Arthur Ashe Stadium", "description": "The largest tennis stadium in the world, located in Flushing Meadows-Corona Park, Queens, hosting the U.S. Open." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Wave Hill", "description": "A public garden and cultural center in the Bronx, offering stunning views of the Hudson River and a variety of nature programs." }
+{ "index": { "_index": "nyc_facts" } }
+{ "title": "Louis Armstrong House", "description": "The former home of jazz legend Louis Armstrong, located in Corona, Queens, now a museum celebrating his life and music." }
+
+```
+
+## 3. Run Semantic Search
+
+### 3.1 Create a Search Pipeline
+Create a search pipeline that will convert your query into an embedding and run K-NN search on the index to return the best-matching documents:
+
+```
+PUT /_search/pipeline/asymmetric_embedding_search_pipeline
+{
+   "description": "ingest passage text and generate a embedding using an asymmetric model",
+   "request_processors": [
+      {
+        "ml_inference": {
+            "query_template": "{\"size\": 3,\"query\": {\"knn\": {\"fact_embedding\": {\"vector\": ${query_embedding},\"k\": 4}}}}",
+            "function_name": "text_embedding",
+            "model_id": "{{ _.model_id }}",
+            "model_input": "{ \"text_docs\": [\"${input_map.query}\"], \"target_response\": [\"sentence_embedding\"], \"parameters\" : {\"content_type\" : \"query\" } }",
+            "input_map": [
+               {
+                  "query": "query.term.fact_embedding.value"
+               }
+            ],
+            "output_map": [
+               {
+                  "query_embedding": "$.inference_results[0].output[0].data",
+                  "embedding_size": "$.inference_results.*.output.*.shape[0]"
+               }
+            ]
+         }
+      }
+   ]
+}
+
+```
+
+### 3.1 Run Semantic Search
+Run a query about sporting activities in New York City:
+```
+GET /nyc_facts/_search?search_pipeline=asymmetric_embedding_search_pipeline
+{
+  "query": {
+    "term": {
+      "fact_embedding": {
+        "value": "What are some places for sports in NYC?",
+       "boost": 1 
+      }
+    }
+  }
+}
+```
+
+The response contains the top 3 matching documents: 
+```json
+{
+  "took": 22,
+  "timed_out": false,
+  "_shards": {
+    "total": 1,
+    "successful": 1,
+    "skipped": 0,
+    "failed": 0
+  },
+  "hits": {
+    "total": {
+      "value": 4,
+      "relation": "eq"
+    },
+    "max_score": 0.12496973,
+    "hits": [
+      {
+        "_index": "nyc_facts",
+        "_id": "hb9X0ZMBICPs-TP0ijZX",
+        "_score": 0.12496973,
+        "_source": {
+          "fact_embedding": [
+            ...
+          ],
+          "embedding_size": [
+            384.0
+          ],
+          "description": "A large public park in the heart of New York City, offering a wide range of recreational activities.",
+          "title": "Central Park"
+        }
+      },
+      {
+        "_index": "nyc_facts",
+        "_id": "ir9X0ZMBICPs-TP0ijZX",
+        "_score": 0.114651985,
+        "_source": {
+          "fact_embedding": [
+            ...
+          ],
+          "embedding_size": [
+            384.0
+          ],
+          "description": "Home to the New York Yankees, this baseball stadium is a historic landmark in the Bronx.",
+          "title": "Yankee Stadium"
+        }
+      },
+      {
+        "_index": "nyc_facts",
+        "_id": "j79X0ZMBICPs-TP0ijZX",
+        "_score": 0.110090025,
+        "_source": {
+          "fact_embedding": [
+            ...
+          ],
+          "embedding_size": [
+            384.0
+          ],
+          "description": "A famous complex of commercial buildings in Manhattan, home to the NBC studios and the annual ice skating rink.",
+          "title": "Rockefeller Center"
+        }
+      }
+    ]
+  }
+}
+```
+---
+
+## References
+
+- Wang, Liang, et al. (2024). *Multilingual E5 Text Embeddings: A Technical Report*. arXiv preprint arXiv:2402.05672. [Link](https://arxiv.org/abs/2402.05672)
+
+---
+