+ Library and documentation to support Vespa data science use cases.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Motivation
+
This library contains application specific code related to data manipulation and analysis of different Vespa use cases. The Vespa python API is used to interact with Vespa applications from python for faster exploration.
+
The main goal of this space is to facilitate prototyping and experimentation for data scientists. Please visit Vespa sample apps for production-ready use cases and Vespa docs for in-depth Vespa documentation.
+
+
+
Install
+
Code to support and reproduce the use cases documented here can be found in the learntorank library.
+
Install via PyPI:
+
pip install learntorank
+
+
+
Development
+
All the code and content of this repo is created using nbdev by editing notebooks. We will give a summary below about the main points required to contribute, but we suggest going through nbdev tutorials to learn more.
+
+
Setting up environment
+
+
Create and activate a virtual environment of your choice. We recommend pipenv.
+
pipenv shell
+
Install Jupyter Lab (or Jupyter Notebook if you prefer).
+
pip3 install jupyterlab
+
Create a new kernel for Jupyter that uses the virtual environment created at step 1.
+
+
Check where the current list of kernels is located with jupyter kernelspec list.
+
Copy one of the existing folder and rename it to learntorank.
+
Modify the kernel.json file that is inside the new folder to reflect the python3executable associated with your virtual env.
+
+
Install nbdev library:
+
pip3 install nbdev
+
Install learntorank in development mode:
+
pip3 install -e .[dev]
+
+
+
+
Most used nbdev commands
+
From your terminal:
+
+
nbdev_help: List all nbdev commands available.
+
nbdev_readme: Update README.md based on index.ipynb
+
Preview documentation while editing the notebooks:
+
+
nbdev_preview --port 3000
+
+
Workflow before pushing code:
+
+
nbdev_test --n_workers 2: Execute all the tests inside notebooks.
+
+
Tests can run in parallel but since we create Docker containers we suggest a low number of workers to preserve memory.
+
+
nbdev_export: Export code from notebooks to the python library.
+
nbdev_clean: Clean notebooks to avoid merge conflicts.
Evaluate query results according to match ratio metric.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
query_results
+
VespaQueryResponse
+
+
Raw query results returned by Vespa.
+
+
+
relevant_docs
+
typing.List[typing.Dict]
+
+
Each dict contains a doc id a optionally a doc score.
+
+
+
id_field
+
str
+
+
The Vespa field representing the document id.
+
+
+
default_score
+
int
+
+
Score to assign to the additional documents that are not relevant. Default to 0.
+
+
+
detailed_metrics
+
bool
+
False
+
Return intermediate computations if available.
+
+
+
Returns
+
typing.Dict
+
+
Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available.
Evaluate query results according to query time metric.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
query_results
+
VespaQueryResponse
+
+
Raw query results returned by Vespa.
+
+
+
relevant_docs
+
typing.List[typing.Dict]
+
+
Each dict contains a doc id a optionally a doc score.
+
+
+
id_field
+
str
+
+
The Vespa field representing the document id.
+
+
+
default_score
+
int
+
+
Score to assign to the additional documents that are not relevant. Default to 0.
+
+
+
detailed_metrics
+
bool
+
False
+
Return intermediate computations if available.
+
+
+
Returns
+
typing.Dict
+
+
Returns the match ratio. In addition, if detailed_metrics=False, returns the number of retrieved docs _retrieved_docs and the number of docs available in the corpus _docs_available.
+
+
+
+
Compute the query time a client would observe (except network latency).
Include detailed metrics. In addition to the search_time above, it returns the time to execute the first protocol phase/matching phase (search_time_query_time) and the time to execute the summary fill protocol phase for the globally ordered top-k hits (search_time_summary_fetch_time).
Evaluate query results according to recall metric.
+
There is an assumption that only documents with score > 0 are relevant. Recall is equal to zero in case no relevant documents with score > 0 is provided.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
query_results
+
VespaQueryResponse
+
+
Raw query results returned by Vespa.
+
+
+
relevant_docs
+
typing.List[typing.Dict]
+
+
Each dict contains a doc id a optionally a doc score.
+
+
+
id_field
+
str
+
+
The Vespa field representing the document id.
+
+
+
default_score
+
int
+
+
Score to assign to the additional documents that are not relevant. Default to 0.
Evaluate query results according to normalized discounted cumulative gain.
+
There is an assumption that documents returned by the query that are not included in the set of relevant documents have score equal to zero. Similarly, if the query returns a number N < at documents, we will assume that those N - at missing scores are equal to zero.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
query_results
+
VespaQueryResponse
+
+
Raw query results returned by Vespa.
+
+
+
relevant_docs
+
typing.List[typing.Dict]
+
+
Each dict contains a doc id a optionally a doc score.
+
+
+
id_field
+
str
+
+
The Vespa field representing the document id.
+
+
+
default_score
+
int
+
+
Score to assign to the additional documents that are not relevant. Default to 0.
+
+
+
detailed_metrics
+
bool
+
False
+
Return intermediate computations if available.
+
+
+
Returns
+
typing.Dict
+
+
Returns the normalized discounted cumulative gain. In addition, if detailed_metrics=False, returns the ideal discounted cumulative gain _ideal_dcg, the discounted cumulative gain _dcg.
[{'query_id': '1101971',
+ 'query': 'why say the sky is the limit',
+ 'relevant_docs': [{'id': '7407715', 'score': 1}]},
+ {'query_id': '712898',
+ 'query': 'what is an cvc in radiology',
+ 'relevant_docs': [{'id': '7661336', 'score': 1}]}]
Use recall to specify which documents should be included in the evaluation.
+
In the example below, we include documents with id equal to 0, 1 and 2. Since the relevant documents for this query are the documents with id 0 and 3, we should get recall equal to 0.5.
It takes a text input and returns an array of floats depending on which model is used to solve the task.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
model_id
+
str
+
+
Id used to identify the model on Vespa applications.
+
+
+
model
+
str
+
+
Id of the model as used by the model hub. Alternatively, it can also be the path to the folder containing the model files, as long as the model config is also there.
+
+
+
tokenizer
+
typing.Optional[str]
+
None
+
Id of the tokenizer as used by the model hub. Alternatively, it can also be the path to the folder containing the tokenizer files, as long as the model config is also there.
Unique model id to represent the model within a Vespa application.
+
+
+
query_input_size
+
int
+
+
The size of the input vector dedicated to the query text.
+
+
+
doc_input_size
+
int
+
+
The size of the input vector dedicated to the document text.
+
+
+
tokenizer
+
typing.Union[str, os.PathLike]
+
+
The name or a path to a saved BERT model tokenizer from the transformers library.
+
+
+
model
+
typing.Union[str, os.PathLike, NoneType]
+
None
+
The name or a path to a saved model that is compatible with the tokenizer. The model is optional at construction since you might want to train it first. You must add a model via :func:add_model before deploying a Vespa application that uses this class.
Create BERT encodings following the same pattern used during Vespa serving. Useful to generate training data and ensuring training and serving compatibility.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
queries
+
typing.List[str]
+
+
Query texts.
+
+
+
docs
+
typing.List[str]
+
+
Document texts.
+
+
+
return_tensors
+
bool
+
False
+
Return tensors
+
+
+
Returns
+
typing.Dict
+
+
Dict containing input_ids, token_type_ids and attention_mask encodings.
Add ranking profile based on a specific model config.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
app_package
+
ApplicationPackage
+
+
Application package to include ranking model
+
+
+
model_config
+
ModelConfig
+
+
Model config instance specifying the model to be used on the RankProfile.
+
+
+
schema
+
NoneType
+
None
+
Name of the schema to add model ranking to.
+
+
+
include_model_summary_features
+
bool
+
False
+
True to include model specific summary features, such as inputs and outputs that are useful for debugging. Default to False as this requires an extra model evaluation when fetching summary features.
+
+
+
document_field_indexing
+
NoneType
+
None
+
List of indexing attributes for the document fields required by the ranking model.
Number of documents: 4
+Number of train queries: 2
+Number of train relevance judgments: 2
+Number of dev queries: 2
+Number of dev relevance judgments: 2
The final sample contains n_relevant train relevant documents, n_relevant dev relevant documents and n_irrelevant random documents sampled from the entire corpus.
+
All the relevant sampled documents, both from train and dev sets, are guaranteed to be on the corpus_sample, which will contain 2 * n_relevant + n_irrelevant documents.
The sampled corpus is a dict containing document id as key and the passage text as value.
+
+
sample.corpus
+
+
{'890370': 'the map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albaniahe map of europe gives you a clear view of the political boundaries that segregate the countries in the continent including germany uk france spain italy greece romania ukraine hungary austria sweden finland norway czech republic belgium luxembourg switzerland croatia and albania',
+ '5060205': 'Setting custom HTTP headers with cURL can be done by using the CURLOPT_HTTPHEADER option, which can be set with the curl_setopt function. To add headers to your HTTP request you need to put them into a PHP Array, which you can then pass to the cul_setopt function, like demonstrated in the below example.',
+ '6096573': "The sugar in RNA is ribose, whereas the sugar in DNA is deoxyribose. The only difference between the two is that in deoxyribose, there is an oxygen missing from the 2' carbon …(there is a H there instead of an OH). This makes DNA more stable/less reactive than RNA. 1 person found this useful.",
+ '3092885': 'All three C-Ph bonds are typical of sp 3 - sp 2 carbon-carbon bonds with lengths of approximately 1.47 A, å while The-C o bond length is approximately.1 42. A å the presence of three adjacent phenyl groups confers special properties manifested in the reactivity of. the alcohol',
+ '7275560': 'shortest phase of mitosis Anaphase is the shortest phase of mitosis. During anaphase the arranged chromosomes at the metaphase plate are migrate towards their respective poles. Before this migration started, chromosomes are divided into sister chromatids, by the separation of joined centromere of two sister chromatids of a chromosomes.'}
+
+
+
The size of the sampled document corpus is equal to 2 * n_relevant + n_irrelevant.
+
+
len(sample.corpus)
+
+
5
+
+
+
Sampled queries are dict containing query id as key and query text as value.
Contains all the request parameters. Set to None if using ‘query_batch’.
+
+
+
query_batch
+
typing.Optional[typing.List[str]]
+
None
+
Query strings. Set to None if using ‘body_batch’.
+
+
+
query_model
+
typing.Optional[main.QueryModel]
+
None
+
Query model to use when sending query strings. Set to None if using ‘body_batch’.
+
+
+
recall_batch
+
typing.Optional[typing.List[typing.Tuple]]
+
None
+
One tuple for each query. Tuple of size 2 where the first element is the name of the field to use to recall and the second element is a list of the values to be recalled.
+
+
+
asynchronous
+
bool
+
True
+
Set True to send data in async mode. Default to True.
+
+
+
connections
+
typing.Optional[int]
+
100
+
Number of allowed concurrent connections, valid only if asynchronous=True.
+
+
+
total_timeout
+
int
+
100
+
Total timeout in secs for each of the concurrent requests when using asynchronous=True.
+
+
+
kwargs
+
+
+
+
+
+
Returns
+
typing.List[vespa.io.VespaQueryResponse]
+
+
HTTP POST responses.
+
+
+
+
Use body_batch to send a batch of body requests.
+
+
body_batch = [
+ {"yql": "select * from sources * where test"},
+ {"yql": "select * from sources * where test2"}
+]
+result = send_query_batch(app=app, body_batch=body_batch)
+
+
Use query_batch to send a batch of query strings to be ranked according a QueryModel.
+
+
result = send_query_batch(
+ app=app,
+ query_batch=["this is a test", "this is a test 2"],
+ query_model=QueryModel(
+ match_phase=OR(),
+ ranking=Ranking()
+ ),
+ hits=10,
+)
+
+
Use recall_batch to send one tuple for each query in query_batch.
+
+
result = send_query_batch(
+ app=app,
+ query_batch=["this is a test", "this is a test 2"],
+ query_model=QueryModel(match_phase=OR(), ranking=Ranking()),
+ hits=10,
+ recall_batch=[("doc_id", [2, 7]), ("doc_id", [0, 5])],
+)
Collect Vespa features based on a set of labelled data.
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
app
+
Vespa
+
+
Connection to a Vespa application.
+
+
+
labeled_data
+
+
+
Labelled data containing query, query_id and relevant ids. See examples about data format.
+
+
+
id_field
+
str
+
+
The Vespa field representing the document id.
+
+
+
query_model
+
QueryModel
+
+
Query model.
+
+
+
number_additional_docs
+
int
+
+
Number of additional documents to retrieve for each relevant document. Duplicate documents will be dropped.
+
+
+
fields
+
typing.List[str]
+
+
Vespa fields to collect, e.g. [“rankfeatures”, “summaryfeatures”]
+
+
+
keep_features
+
typing.Optional[typing.List[str]]
+
None
+
List containing the names of the features that should be returned. Default to None, which return all the features contained in the ‘fields’ argument.
+
+
+
relevant_score
+
int
+
1
+
Score to assign to relevant documents. Default to 1.
+
+
+
default_score
+
int
+
0
+
Score to assign to the additional documents that are not relevant. Default to 0.
+
+
+
kwargs
+
+
+
+
+
+
Returns
+
DataFrame
+
+
DataFrame containing document id (document_id), query id (query_id), scores (relevant) and vespa rank features returned by the Query model RankProfile used.
+
+
+
+
Usage:
+
Define labeled_data as a list of dict containing relevant documents:
+
+
labeled_data = [
+ {
+"query_id": 0,
+"query": "give me title 1",
+"relevant_docs": [{"id": "1", "score": 1}],
+ },
+ {
+"query_id": 1,
+"query": "give me title 3",
+"relevant_docs": [{"id": "3", "score": 1}],
+ },
+]
Compute bootstrap estimates of the data distribution
+
+
+
+
+
+
+
+
+
+
+
Type
+
Default
+
Details
+
+
+
+
+
data
+
DataFrame
+
+
Data containing the columns we want to generate bootstrap estimates from.
+
+
+
estimator
+
typing.Callable
+
mean
+
estimator function that accepts an array-like argument.
+
+
+
n_boot
+
int
+
1000
+
Number of bootstrap estimates to compute.
+
+
+
columns_to_exclude
+
typing.List[str]
+
None
+
Column names to exclude.
+
+
+
+
Usage:
+
Generate data with columns containing data that we want to compute estimates from. The values in the column a comes from Normal distribution with mean 0 and standard deviation 1. The values from column b comes from Normal distribution with mean 100 and standard deviation 10.
By default, the function generates the mean of each column n_boot times. Each value represents the mean obtained from a bootstrap sample of the original data.
We can check if the estimates make sense by compute the mean of the bootstrap estimates and comparing with the mean of the Normal distribution they were generated from.
+
+
estimates.mean()
+
+
a 0.089538
+b 100.099900
+dtype: float64
+
+
+
+
+
Specify function. Example: Standard deviation.
+
We can specify other functions, such as np.std to compute the standard deviation.
If we take the mean of the bootstrap estimates of the standard deviation, we should recover a value close to the standard deviation of the distribution that the data were generated from.
To limit run-time of this notebook, do not use all models - modify the set:
+
+
# Limit number of models to test - here, first two models
+use_models = clip.available_models()[0:2]
+
+
Each model has an embedding size, needed in the text-image search application schema:
+
+
embedding_info = {name: clip.load(name)[0].visual.output_dim for name in use_models}
+embedding_info
+
+
+
Create and deploy a text-image search app
+
+
Create the Vespa application package
+
The function create_text_image_app below uses the Vespa python API to create an application package with fields to store each of the different types of image embedding associated with the CLIP models. It also declares the types of the text embeddings that we are going to send along with the query when searching for images, and creates one ranking profile for each (text, image) embedding model:
+
+
from embedding import create_text_image_app
+
+app_package = create_text_image_app(embedding_info)
+
+
Inspect the schema of the resulting application package:
Get a sample data set. See download_flickr8k.sh for how to download images. Set location of images:
+
For each of the CLIP models, compute the image embeddings and send it to the Vespa app:
+
+
from embedding import compute_and_send_image_embeddings
+
+compute_and_send_image_embeddings(app=app, batch_size=128, clip_model_names=use_models)
+
+
+
+
Define QueryModel’s to be evaluated
+
Create one QueryModel for each of the CLIP models. In order to do that, we need to have a function that takes a query as input and outputs the body function of a Vespa query request - example:
+
+
from embedding import create_vespa_query_body_function
+
+vespa_query_body_function = create_vespa_query_body_function("RN50")
+vespa_query_body_function("this is a test query")["yql"]
+
+
With a method to create Vespa query body functions, we can create QueryModels that will be used to evaluate each search configuration that is to be tested. In this case, each query model will represent a CLIP model text-image representation:
+
+
from learntorank.query import QueryModel
+
+query_models = [QueryModel(
+ name=model_name,
+ body_function=create_vespa_query_body_function(model_name)
+) for model_name in use_models]
+
+
A query model contains all the information that is necessary to define how the search app will match and rank documents. Use it to query the application:
+
+
from embedding import plot_images
+from learntorank.query import send_query
+
+query_result = send_query(app, query="a person surfing", query_model=query_models[-1], hits =4)
+
+
To inspect the results, use query_result.hits[0]. Display top two:
+
+
from IPython.display import Image, display
+
+image_file_names = [ hit["fields"]["image_file_name"] for hit in query_result.hits[:2] ]
+
+for image in image_file_names:
+ display(Image(filename=os.path.join(os.environ["IMG_DIR"], image)))
+
+
+
+
Evaluate
+
Now that there is one QueryModel for each CLIP model available, it is posible to evaluate and compare them.
+
Define search evaluation metrics:
+
+
from learntorank.evaluation import MatchRatio, Recall, ReciprocalRank
+
+eval_metrics = [
+ MatchRatio(), # Match ratio is just to show the % of documents that are matched by ANN
+ Recall(at=100),
+ ReciprocalRank(at=100)
+]
+
+
Load labeled data. It was assumed that a (caption, image) pair is relevant if all three experts agreed that the caption accurately described the image:
topics contain data about the 50 topics available, including query, question and narrative.
+
+
topics["1"]
+
+
{'query': 'coronavirus origin',
+ 'question': 'what is the origin of COVID-19',
+ 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}
+
+
+
relevance_data contains the relevance judgments for each of the 50 topics.
+
+
relevance_data.head(5)
+
+
+
+
+
+
+
+
+
topic_id
+
round_id
+
cord_uid
+
relevancy
+
+
+
+
+
0
+
1
+
4.5
+
005b2j4b
+
2
+
+
+
1
+
1
+
4.0
+
00fmeepz
+
1
+
+
+
2
+
1
+
0.5
+
010vptx3
+
2
+
+
+
3
+
1
+
2.5
+
0194oljo
+
1
+
+
+
4
+
1
+
4.0
+
021q9884
+
1
+
+
+
+
+
+
+
+
+
+
Format the labeled data into expected pyvespa format
+
pyvespa expects labeled data to follow the format illustrated below. It is a list of dict where each dict represents a query containing query_id, query and a list of relevant_docs. Each relevant document contains a required id key and an optional score key.
We can create labeled_data from the topics and relevance_data that we downloaded before. We are only going to include documents with relevance score > 0 into the final list.
+
+
labeled_data = [
+ {
+"query_id": int(topic_id),
+"query": topics[topic_id]["query"],
+"relevant_docs": [
+ {
+"id": row["cord_uid"],
+"score": row["relevancy"]
+ } for idx, row in relevance_data[relevance_data.topic_id ==int(topic_id)].iterrows() if row["relevancy"] >0
+ ]
+ } for topic_id in topics.keys()]
+
+
+
+
Define query models to be evaluated
+
We are going to define two query models to be evaluated here. Both will match all the documents that share at least one term with the query. This is defined by setting match_phase = OR().
+
The difference between the query models happens in the ranking phase. The or_default model will rank documents based on nativeRank while the or_bm25 model will rank documents based on BM25. Discussion about those two types of ranking is out of the scope of this tutorial. It is enough to know that they rank documents according to two different formulas.
+
Those ranking profiles were defined by the team behind the cord19 app and can be found here.
The files used in this section were originally found at https://ir.nist.gov/covidSubmit/data.html. We will download both the topics and the relevance judgements data. Do not worry about what they are just yet, we will explore them soon.
The topics file is in XML format. We can parse it and store in a dictionary called topics. We want to extract a query, a question and a narrative from each topic.
+
+
import xml.etree.ElementTree as ET
+
+topics = {}
+root = ET.parse("topics-rnd5.xml").getroot()
+for topic in root.findall("topic"):
+ topic_number = topic.attrib["number"]
+ topics[topic_number] = {}
+for query in topic.findall("query"):
+ topics[topic_number]["query"] = query.text
+for question in topic.findall("question"):
+ topics[topic_number]["question"] = question.text
+for narrative in topic.findall("narrative"):
+ topics[topic_number]["narrative"] = narrative.text
+
+
There are a total of 50 topics. For example, we can see the first topic below:
+
+
topics["1"]
+
+
{'query': 'coronavirus origin',
+ 'question': 'what is the origin of COVID-19',
+ 'narrative': "seeking range of information about the SARS-CoV-2 virus's origin, including its evolution, animal source, and first transmission into humans"}
+
+
+
Each topic has many relevance judgements associated with them.
+
+
+
Relevance judgements
+
We can load the relevance judgement data directly into a pandas DataFrame.
The relevance data contains all the relevance judgements made throughout the 5 rounds of the competition. relevancy equals to 0 is irrelevant, 1 is relevant and 2 is highly relevant.
+
+
relevance_data.head()
+
+
+
+
+
+
+
+
+
topic_id
+
round_id
+
cord_uid
+
relevancy
+
+
+
+
+
0
+
1
+
4.5
+
005b2j4b
+
2
+
+
+
1
+
1
+
4.0
+
00fmeepz
+
1
+
+
+
2
+
1
+
0.5
+
010vptx3
+
2
+
+
+
3
+
1
+
2.5
+
0194oljo
+
1
+
+
+
4
+
1
+
4.0
+
021q9884
+
1
+
+
+
+
+
+
+
+
We are going to remove two rows that have relevancy equal to -1, which I am assuming is an error.
Define some labeled data. pyvespa expects labeled data to follow the format illustrated below. It is a list of dict where each dict represents a query containing query_id, query and a list of relevant_docs. Each relevant document contain a required id key and an optional score key.
Use recall to specify which documents should be inlcuded in the evaluation
+
+
In the example below, we include documents with id equal to 0, 1 and 2. Since the relevant documents for this query are the documents with id 0 and 3, we should get recall equal to 0.5.
ToDo: This notebook is still work in progress and cannot yet be auto-run
+
+
+
+
+
+
Create the application package
+
Create an application package:
+
+
from vespa.package import ApplicationPackage
+
+app_package = ApplicationPackage(name="imagesearch")
+
+
Add a field to hold the name of the image file. This is used in the sample app to load the final images that should be displayed to the end user.
+
The summary indexing ensures this field is returned as part of the query response. The attribute indexing store the fields in memory as an attribute for sorting, querying, and grouping:
+
+
from vespa.package import Field
+
+app_package.schema.add_fields(
+ Field(name="image_file_name", type="string", indexing=["summary", "attribute"]),
+)
+
+
Add a field to hold an image embedding. The embeddings are usually generated by a ML model. We can add multiple embedding fields to our application. This is useful when making experiments. For example, the sample app adds 6 image embeddings, one for each of the six pre-trained CLIP models available at the time.
+
In the example below, the embedding vector has size 512 and is of type float. The index is required to enable approximate matching and the HNSW instance configure the HNSW index:
Add a rank profile that ranks the images by how close the image embedding vector is from the query embedding vector. The tensors used in queries must have their type declared in the application package, the code below declares the text embedding that will be sent in the query - it has the same size and type of the image embedding:
The application package created above can be deployed using Docker or Vespa Cloud. Follow the instructions based on the desired deployment mode. Either option will create a Vespa connection instance that can be stored in a variable that will be denoted here as app.
+
We can then use app to interact with the deployed application:
One of the advantages of having a python API is that it can integrate with commonly used ML frameworks. The sample application show how to create a PyTorch DataLoader to generate batches of image data by using CLIP models to generate image embeddings.
+
+
+
Query the application
+
The following query will use approximate nearest neighbor search to match the closest images to the query text and rank the images according to their distance to the query text. The sample application used CLIP models to generate image and query embeddings.
The sample application illustrates how to evaluate different CLIP models through the evaluate method:
+
+
result = app.evaluate(
+ labeled_data=labeled_data, # Labeled data to define which images should be returned to a given query
+ eval_metrics=eval_metrics, # Metrics used
+ query_model=query_models, # Each query model uses a different CLIP model version
+ id_field="image_file_name", # The name of the id field used by the labeled data to identify the image
+ per_query=True# Return results per query rather the aggragated.
+)
+
+
The figure below is the reciprocal rank at 100 computed based on the output of the evaluate method.
+ Data pipelines, model fitting and feature selection
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
This notebook is WIP and not runnable - ToDo FIXME
+
+
Data
+
This section describes the data that we are going to use to give a brief overview of the pyvespa ranking framework. The data was collected from a running Vespa application indexed with MS MARCO data. For each relevant (document_id, query_id)-pair we collected 9 random matched documents. Relevant documents have label=1 and non-relevant documents have label=0. In addition, many Vespa ranking features computed based on document and query interaction are included.
The ListwiseRankingFramework uses TensorFlow Ranking to minimize a listwise loss function that is a smooth approximation of the NDCG metric. The following parameters need to be specified:
+
+
from learntorank.ranking import ListwiseRankingFramework
+
+ranking_framework = ListwiseRankingFramework(
+#
+# Task related
+#
+ number_documents_per_query=10, # The size of the list for each sample
+ top_n=10, # What NDCG position we want to optmize, e.g. NDCG@10
+#
+# Data pipeline
+#
+ batch_size=32, # Batch size used when fitting models to the data
+ shuffle_buffer_size=1000, # The buffer size used when shuffling data batches.
+#
+# Hyperparameter tuning
+#
+ tuner_max_trials=3, # How many trials to execute when search hyperparameters
+ tuner_executions_per_trial=1, # How may model fit per trial
+ tuner_epochs=10, # How many epochs to use per execution of the trial
+ tuner_early_stop_patience=None, # Set patience number for early stopping
+#
+# Final model
+#
+ final_epochs=30# Number of epochs to use when fitting the model with specific hyperparameters.
+)
+
+
WARNING:tensorflow:There are non-GPU devices in `tf.distribute.Strategy`, not using nccl allreduce.
+INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:CPU:0',)
+
+
+
+
Data pipeline
+
It is possible to create TensorFlow data pipelines (tf.data.Dataset) either from in-memory data frames or directly from .csv files to avoid the need to load large file into memory. The data pipelines are suited for listwise ranking and can be used as part of a custom tensorflow workflow if desired.
+
Create a tf.data.Dataset from in-memory data frames:
The ranking framework comes with same pre-defined models in case you don’t want to use the data pipelines to create your own workflow. It is possible to specify either a DataFrame or a .csv file path as the train and dev input data. If the hyperparameters argument is not specified it will search through the hyperparameter space accordinng to the arguments defined when creating and instance of the ListwiseRankingFramework.
The are some pre-defined algorithms that can be used for feature selection. The goal is to find a subset of features that are responsible for most of the evaluation metric gains.
+
+
Lasso model search
+
Fit a lasso model with all feature_names. Sequentially remove the feature with the smallest absolute weight until there is only one feature in the model.
[
+f"Number of features {len(result['weights']['feature_names'])}; Eval metric: {result['evaluation']}"
+for result in results
+]
+
+
['Number of features 6; Eval metric: 0.7820510864257812',
+ 'Number of features 5; Eval metric: 0.7812100052833557',
+ 'Number of features 4; Eval metric: 0.7958707809448242',
+ 'Number of features 3; Eval metric: 0.7378504872322083',
+ 'Number of features 2; Eval metric: 0.7098456025123596',
+ 'Number of features 1; Eval metric: 0.7048170566558838']
+
+
+
+
[result['weights']['feature_names'] for result in results]
Starting in version 0.5.0 we can bypass the pyvespa high-level API and create a QueryModel with the full flexibility of the Vespa Query API. This is useful for use cases not covered by the pyvespa API and for users that are familiar with and prefer to work with the Vespa Query API.
+
+
def body_function(query):
+ body = {'yql': 'select * from sources * where userQuery();',
+'query': query,
+'type': 'any',
+'ranking': {'profile': 'bm25', 'listFeatures': 'false'}}
+return body
+
+flexible_query_model = QueryModel(body_function = body_function)
+
+
The flexible_query_model defined above is equivalent to the standard_query_model, as we can see when querying the app. We will use the cord19 app in our demonstration.
+
+
from vespa.application import Vespa
+
+app = Vespa(url ="https://api.cord19.vespa.ai")
+
+
+
from learntorank.query import send_query
+
+standard_result = send_query(
+ app=app,
+ query="this is a test",
+ query_model=standard_query_model
+)
+standard_result.get_hits().head(3)
+
+
+
flexible_result = send_query(
+ app=app,
+ query="this is a test",
+ query_model=flexible_query_model
+)
+flexible_result.get_hits().head(3)
Each feature data point will have the shape equal to (batch_size, number_documents_per_query, number_features) and each label data point will have shape equal to (batch_size, number_documents_per_query).
+
+
import tensorflow as tf
+
+
The code below creates a TensorFlow data pipeline (tf.data.Dataset) from our DataFrame and group the rows by the query_id variable to form a listwise dataset. We then configure the data pipeline to shuffle and set a batch size.
We can see the shape of the features and of the labels are as expected.
+
+
for d in listwise_ds.take(1):
+print(d[0].shape)
+print(d[1].shape)
+
+
(32, 10, 3)
+(32, 10)
+
+
+
+
+
Create and compile model
+
We are going to create a linear model that can take a listwise data as input with shape (batch_size, number_documents_per_query, number_features) and output one prediction per document with shape (batch_size, number_documents_per_query)
model = tf.keras.Sequential(layers=[input_layer, dense_layer, output_layer])
+
+
In this tutorial, we want to optimize the Normalized Discounted Cumulative Gain at position 10 (NDCG@10). We then select a loss function that is a smooth approximation of the NDCG metric and create a stateless NDCG@10 metric to use when compiling the model defined above.
+
+
import tensorflow_ranking as tfr
+
+ndcg = tfr.keras.metrics.NDCGMetric(topn=10)
+def ndcg_stateless(y_true, y_pred):
+"""
+ Create stateless metric so that we can compute the validation metric
+ from scratch at the end of each epoch.
+ """
+ ndcg.reset_states()
+return ndcg(y_true, y_pred)
+
+optimizer = tf.keras.optimizers.Adagrad(learning_rate=2)
+model.compile(
+ optimizer=optimizer,
+ loss=tfr.keras.losses.ApproxNDCGLoss(),
+ metrics=ndcg_stateless,
+)
After training the model by minimizing a listwise loss function, we can simplify the model before deploying it to Vespa. At inference time, Vespa will evaluate each document individually and use a ranking function to rank documents.
+
Therefore, the input layer will expect a tensor named input with shape equal to (1, number_features).
We are going to save the simpler_model to disk and then use the tf2onnx tool to convert the model to ONNX format.
+
+
simpler_model.save("simpler_keras_model")
+
+
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
+INFO:tensorflow:Assets written to: simpler_keras_model/assets
+
+
+
INFO:tensorflow:Assets written to: simpler_keras_model/assets
<frozen runpy>:128: RuntimeWarning: 'tf2onnx.convert' found in sys.modules after import of package 'tf2onnx', but prior to execution of 'tf2onnx.convert'; this may result in unpredictable behaviour
+2023-08-08 14:09:40,224 - WARNING - '--tag' not specified for saved_model. Using --tag serve
+2023-08-08 14:09:40,328 - INFO - Signatures found in model: [serving_default].
+2023-08-08 14:09:40,328 - WARNING - '--signature_def' not specified, using first signature: serving_default
+2023-08-08 14:09:40,328 - INFO - Output names: ['dense']
+2023-08-08 14:09:40,328 - WARNING - Could not search for non-variable resources. Concrete function internal representation may have changed.
+WARNING:tensorflow:From /usr/local/lib/python3.11/site-packages/tf2onnx/tf_loader.py:557: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
+Instructions for updating:
+This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
+2023-08-08 14:09:40,379 - WARNING - From /usr/local/lib/python3.11/site-packages/tf2onnx/tf_loader.py:557: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
+Instructions for updating:
+This API was designed for TensorFlow v1. See https://www.tensorflow.org/guide/migrate for instructions on how to migrate your code to TensorFlow v2.
+2023-08-08 14:09:40,388 - INFO - Using tensorflow=2.13.0, onnx=1.14.0, tf2onnx=1.8.4/cd55bf
+2023-08-08 14:09:40,388 - INFO - Using opset <onnx, 9>
+2023-08-08 14:09:40,389 - INFO - Computed 0 values for constant folding
+2023-08-08 14:09:40,395 - INFO - Optimizing ONNX model
+2023-08-08 14:09:40,402 - INFO - After optimization: Identity -5 (5->0)
+2023-08-08 14:09:40,403 - INFO -
+2023-08-08 14:09:40,403 - INFO - Successfully converted TensorFlow model simpler_keras_model to ONNX
+2023-08-08 14:09:40,403 - INFO - Model inputs: ['input:0']
+2023-08-08 14:09:40,403 - INFO - Model outputs: ['dense']
+2023-08-08 14:09:40,403 - INFO - ONNX model is saved at simpler_keras_model.onnx
+
+
+
We can inspect the onnx model input and output. We first load the ONNX model:
This section will use the Vespa python API pyvespa to create an application package with a ranking function that uses the tensorflow model exported to ONNX.
+
The data used to train the model was derived from a Vespa application based on the MS Marco passage dataset. So, we are going to name the application msmarco, and start by adding two fields: id to hold the document id and text to hold the passages from the msmarco dataset.
+
indexing configuration: We add "summary" to the indexing parameter because we want to include both the id and the text field in the query results. The "attribute" indicates that the field id will be stored in-memory. The "index" indicates that Vespa will create a search index for the text field.
Note that at each step along the application package definition, we can inspect the content of the Vespa search definition file:
+
+
print(app_package.schema.schema_to_text)
+
+
schema msmarco {
+ document msmarco {
+ field id type string {
+ indexing: summary | attribute
+ }
+ field text type string {
+ indexing: summary | index
+ }
+ }
+}
+
+
+
Add simpler_keras_model.onnx to the schema. * The model_name is an id that can be used in the ranking function to identify which model to use. * The model_file_path is the current path of the .onnx file. When deploying the application, pyvespa will move the file to the correct location inside the Vespa application package folder. * The inputs maps the name of the inputs contained in the ONNX model to the name of the Vespa source that will be used as input to the model. In this case we will create a function called vespa_input that output a tensor of type float with the expected shape (1, 3). * The outputs maps the output name in the ONNX file to the output name that will be recognized by Vespa.
It is possible to see the addition of the onnx-model section in the search definition below. Note that the model file is expected to be under the files folder inside the final application package folder, but pyvespa takes care of the model file placement when deploying the application.
+
+
print(app_package.schema.schema_to_text)
+
+
schema msmarco {
+ document msmarco {
+ field id type string {
+ indexing: summary | attribute
+ }
+ field text type string {
+ indexing: summary | index
+ }
+ }
+ onnx-model ltr_tensorflow {
+ file: files/ltr_tensorflow.onnx
+ input input:0: vespa_input
+ output dense: dense
+ }
+}
+
+
+
Add a rank profile named tensorflow that uses the TensorFlow model to rank documents. * first_phase: We use the Vespa ranking feature onnx to access the ONNX model named ltr_tensorflow and use the output dense. We apply the sum because Vespa requires the relevance score to be a scaler and the output of the ONNX model in this case is a tensor of shape (1,1). * vespa_input function: The ONNX model was trained with the features fieldMatch(text).queryCompleteness, fieldMatch(text).significance and nativeRank(text) and expects and tensor of shape (1,3) containing those features. * summary_features: Summary features allow us to specify Vespa features to be included in the output of a query. In this case, we want to access to the model inputs and output to check if the Vespa model evaluation is the same as if we use the original TensorFlow model.
The code below shows the YQL expression that will be used to select the documents to be ranked.
+
+
"select * from sources * where ({{grammar: 'any', defaultIndex: 'text'}}userInput('{}'))".format(query_text)
+
+
"select * from sources * where ({grammar: 'any', defaultIndex: 'text'}userInput('why say the sky is the limit'))"
+
+
+
The function get_vespa_prediction_and_features will match documents using the YQL expression above and rank the documents with the rank-profile tensorflow that we defined in the Vespa application package.
+
+
def get_vespa_prediction_and_features(query_text):
+# Send query and extract hits
+ hits = app.query(
+ body={
+"yql": "select * from sources * where ({{'grammar': 'any', 'defaultIndex': 'text'}}userInput('{}'));".format(query_text),
+"ranking": "tensorflow"
+ }
+ ).hits
+ result =[]
+# For each hit, extract the inputs to the model along with model predictions computed by Vespa
+for hit in hits:
+ result.append({
+"fieldMatch(text).queryCompleteness": hit["fields"]["summaryfeatures"]["fieldMatch(text).queryCompleteness"],
+"fieldMatch(text).significance": hit["fields"]["summaryfeatures"]["fieldMatch(text).significance"],
+"nativeRank(text)": hit["fields"]["summaryfeatures"]["nativeRank(text)"],
+"vespa_prediction": hit["relevance"],
+ })
+return pd.DataFrame.from_records(result)
For the passage ranking use case, we will use the MS MARCO passage dataset1 through the ir_datasets library. Besides being convenient, ir_datasets solves encoding errors in the original dataset source files.
+
+
import ir_datasets
+import pandas as pd
+
+
+
Data Exploration
+
+
Document corpus
+
Start by loading the data. The dataset will be downloaded once and cached on disk for future use, so it takes a while the first time the command below is run.
It is interesting to check what is the range of values of the relevance score. The code below shows that the only score available is 1, indicating that the particular document id is relevant to the query id.
+
+
set([score
+for relevant in train_qrels_dict.values()
+for score in relevant.values()]
+ )
+
+
{1}
+
+
+
+
+
Queries
+
Number of training queries:
+
+
passage_train.queries_count()
+
+
502939
+
+
+
The number of queries differs from the number of relevant documents because some of the queries have more than one relevant document associated with it.
Given the large amount of data, it is useful to properly sample data when prototyping, which can be done with the sample_data function. This might take same time in case the full dataset needs to be downloaded for the first time.
+
+
from learntorank.passage import sample_data
+
+passage_sample = sample_data(n_relevant=100, n_irrelevant=800)
+ Compare different metrics and their uncertainty in the passage ranking dataset.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
When working with search engine apps, be it a text search or a recommendation system, part of the job is doing experiments around components such as ranking functions and deciding which experiments deliver the best result.
+
This tutorial builds a text search app with Vespa, feeds a sample of the passage ranking dataset to the app, and evaluates two ranking functions across three different metrics. In addition to return point estimates of the evaluation metrics, we compute confidence intervals as illustrated in the plot below. Measuring uncertainty around the metric estimates gives us a better sense of how significant is the impact of our changes in the application.
+
+
The code and the data used in this end-to-end tutorial are available and can be reproduced in a Jupyter Notebook.
+
+
Create the Vespa application package
+
Create a Vespa application package to perform passage ranking experiments using the create_basic_search_package.
+
+
from learntorank.passage import create_basic_search_package
+
+app_package = create_basic_search_package()
In this tutorial, we are going to compare two ranking functions. One is based on NativeRank, and the other is based on BM25.
+
+
+
Deploy the application
+
Deploy the application package in a Docker container for local development. Alternatively, it is possible to deploy the application package to Vespa Cloud.
Number of documents: 1000
+Number of train queries: 100
+Number of train relevance judgments: 100
+Number of dev queries: 100
+Number of dev relevance judgments: 100
Create the bm25QueryModel, which uses Vespa’s weakAnd operator to match documents relevant to the query and use the bm25rank-profile that we defined in the application package above to rank the documents.
[{'fields': {'doc_id': '7407715',
+ 'documentid': 'id:PassageRanking:PassageRanking::7407715',
+ 'sddocname': 'PassageRanking',
+ 'summaryfeatures': {'bm25(text)': 11.979235042476953,
+ 'vespa.summaryFeatures.cached': 0.0},
+ 'text': 'The Sky is the Limit also known as TSITL is a global '
+ 'effort designed to influence, motivate and inspire '
+ 'people all over the world to achieve their goals and '
+ 'dreams in life. TSITL’s collaborative community on '
+ 'social media provides you with a vast archive of '
+ 'motivational pictures/quotes/videos.'},
+ 'id': 'id:PassageRanking:PassageRanking::7407715',
+ 'relevance': 11.979235042476953,
+ 'source': 'PassageRanking_content'},
+ {'fields': {'doc_id': '84721',
+ 'documentid': 'id:PassageRanking:PassageRanking::84721',
+ 'sddocname': 'PassageRanking',
+ 'summaryfeatures': {'bm25(text)': 11.310323797415357,
+ 'vespa.summaryFeatures.cached': 0.0},
+ 'text': 'Sky Customer Service 0870 280 2564. Use the Sky contact '
+ 'number to get in contact with the Sky customer services '
+ 'team to speak to a representative about your Sky TV, Sky '
+ 'Internet or Sky telephone services. The Sky customer '
+ 'Services team is operational between 8:30am and 11:30pm '
+ 'seven days a week.'},
+ 'id': 'id:PassageRanking:PassageRanking::84721',
+ 'relevance': 11.310323797415357,
+ 'source': 'PassageRanking_content'}]
[{'fields': {'doc_id': '7407715',
+ 'documentid': 'id:PassageRanking:PassageRanking::7407715',
+ 'sddocname': 'PassageRanking',
+ 'summaryfeatures': {'bm25(text)': 11.979235042476953,
+ 'vespa.summaryFeatures.cached': 0.0},
+ 'text': 'The Sky is the Limit also known as TSITL is a global '
+ 'effort designed to influence, motivate and inspire '
+ 'people all over the world to achieve their goals and '
+ 'dreams in life. TSITL’s collaborative community on '
+ 'social media provides you with a vast archive of '
+ 'motivational pictures/quotes/videos.'},
+ 'id': 'id:PassageRanking:PassageRanking::7407715',
+ 'relevance': 11.979235042476953,
+ 'source': 'PassageRanking_content'},
+ {'fields': {'doc_id': '84721',
+ 'documentid': 'id:PassageRanking:PassageRanking::84721',
+ 'sddocname': 'PassageRanking',
+ 'summaryfeatures': {'bm25(text)': 11.310323797415357,
+ 'vespa.summaryFeatures.cached': 0.0},
+ 'text': 'Sky Customer Service 0870 280 2564. Use the Sky contact '
+ 'number to get in contact with the Sky customer services '
+ 'team to speak to a representative about your Sky TV, Sky '
+ 'Internet or Sky telephone services. The Sky customer '
+ 'Services team is operational between 8:30am and 11:30pm '
+ 'seven days a week.'},
+ 'id': 'id:PassageRanking:PassageRanking::84721',
+ 'relevance': 11.310323797415357,
+ 'source': 'PassageRanking_content'}]
+
+
+
+
+
+
Evaluate query models
+
In this section, we want to evaluate and compare the bm25_query_model defined above with the native_query_model defined below:
It is straightforward to obtain point estimates of the evaluation metrics for each query model being compared. In this case, we computed the mean and the standard deviation for each of the metrics.
Given the nature of the data distribution of the metrics described above, it is not trivial to compute a confidence interval from the mean and the standard deviation computed above. In the next section, we solve this by using bootstrap sampling on a per query metric evaluation.
+
+
+
Uncertainty estimates
+
Instead of returning aggregated point estimates, we can also compute the metrics per query by setting per_query=True. This gives us more granular information on the distribution function of the metrics.
+ Accelerated model evaluation using ONNX Runtime in the stateless cluster
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Vespa has implemented accelerated model evaluation using ONNX Runtime in the stateless cluster. This opens up new usage areas for Vespa, such as serving model predictions.
+
+
Define the model server
+
The SequenceClassification task takes a text input and returns an array of floats that depends on the model used to solve the task. The model argument can be the id of the model as defined by the huggingface model hub.