From 1a7e4248c35a0941f7d6c39d94f999d2e8562d3b Mon Sep 17 00:00:00 2001
From: MariusArhaug Synopsis: Example
+ Language of the text field, must be a valid language code from the RFC5646 standard.
+ Document type identifier for the vespa dump file.
-Constrained in rank-profile. True or false. By default this is false. When enabled Vespa will use the significance calculation based on provided significance models in the service.xml for the rank-profile it is defined in.
+Contained in rank-profile. True or false. By default this is false. When enabled Vespa will use the significance calculation based on provided significance models in the service.xml for the rank-profile it is defined in.
-The significance tag can include multiple models. Their order determines the model precedence for a given language, with the last element having the highest. The models' document frequency is used to set a token's significance value based on the inverse document frequency (IDF). To enable the use of these models, the schema needs to have a rank-profile field with the significance element and the use-model flag set to true.
+The significance element can include multiple models. Their order determines the model precedence for a given language, with the last element having the highest. The models' document frequency is used to set a token's significance. To enable the use of these models, the schema needs to have a rank-profile with the significance element and the use-model set to true.
Example of significance model with multiple models. These models are either provided by Vespa or can be generated with the vepsa-signficance cli. Example with multiple model files. These models are either provided by Vespa or can be generated with the vespa-signficance cli. vespa-set-node-state
-vespa-significance
-vepsa-signficance
is a tool that generates a significance model file based on this file format. Its input is a vespa-feed
file.
+vespa-significance
+vespa-signficance
is a tool that generates a significance model file. Its input is a vespa-feed file.
vespa-significance [options]
vespa-significance
-h, --help
-
- Help text
-
+ Help text
- -i, --input <input file>
+ -i, --in <input file>
- Vespa dump file to be used for generating the significance model
+ Vespa-feed file to be used for generating the significance model
-o, --out <output file>
@@ -1945,17 +1943,24 @@ vespa-significance
-f, --field <field>
- Name of the text field to be used for tokenization
+ Name of the text field to be used for significance model
-l, --language <language>
- Language of the text field, must be a valid language code from the RFC5646 standard.
+
+ It is used with
+ OpenNLP's tokenizer to tokenize the text field based on that language's rules.
+
diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html
index eae80f23d6..41cd3aa8fd 100644
--- a/en/reference/schema-reference.html
+++ b/en/reference/schema-reference.html
@@ -2488,7 +2488,7 @@ -d, --doc-type <doc-id>
- Document type identifier for the dump file
+
+ It becomes a part of the id for put operations in the vespa-feed file. { "put": "id::<doc-id>::1" }
+ onnx-model
significance
significance {
use-model: true
diff --git a/en/reference/services-search.html b/en/reference/services-search.html
index 65ee2c76e7..7ca79d44c9 100644
--- a/en/reference/services-search.html
+++ b/en/reference/services-search.html
@@ -331,10 +331,10 @@
renderer
significance
<significance>
<model model-id="wikimedia"/>
@@ -345,6 +345,30 @@
significance
Name | +Occurrence | +Description | +Type | +Default | +
---|---|---|---|---|
model | +One To Many | +Use to point to the significance model file | +model-type | +N/A | +
Specifies how a search chain should be instantiated, and how the contained searchers should be ordered. diff --git a/en/reference/significance-reference.html b/en/reference/significance-reference.html deleted file mode 100644 index 0c72a62320..0000000000 --- a/en/reference/significance-reference.html +++ /dev/null @@ -1,107 +0,0 @@ ---- -# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -title: "Significance Reference" -redirect_from: -- /documentation/reference/significance-reference.html ---- - -
Reference configuration for embedders.
- -
- Significance model uses the model type configuration.
- The model type configuration accepts attributes model-id
, url
or path
,
- and multiple of these can be specified as a single config value. The model order determines the precedence for a given language - with the last element having the highest precedence.
-
model-id
is specified and the application is deployed on Vespa Cloud, the model-id
is used.url
is specified, it is usedpath
is used.
- When using path
, the model files must be supplied in the
- Vespa application package.
-
- A significance component is comprised of one or multiple significance models, for one or multiple languages. It uses these models' document frquencies to calculate the inverse document frequency (IDF) of terms in a query. -
-
- The significance component is configured in services.xml, with the significance
tag:
-
{% highlight xml %} -- -- -{% endhighlight %}- -- -- - -
Name | -Occurrence | -Description | -Type | -Default | -
---|---|---|---|---|
model | -One To Many | -Use to point to the significance model file | -model-type | -N/A | -
-The significance model file is a JSON file with the following format: -
{% highlight json %} -{ - "version": 1, - "id": "wikipedia", - "description": "Some optional description", - "languages": { - "en": { - "description": "Some optional description for English model", - "document-count": 1000, - "document-frequencies": { - "and": 500, - "car": 100, - ... - } - }, - "no": { - "description": "Some optional description for Norwegian model", - "document-count": 800, - "document-frequencies": { - "bil": 80, - "og": 400, - ... - } - } - } - } -{% endhighlight %} -- -
-Each file contains a map of languages and their document frequencies. - -
- \ No newline at end of file diff --git a/en/reference/significance.html b/en/reference/significance.html deleted file mode 100644 index 1a63ca6a87..0000000000 --- a/en/reference/significance.html +++ /dev/null @@ -1,217 +0,0 @@ ---- -# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -title: "Significance Model Reference" -redirect_from: -- /documentation/reference/significance.html ---- --The -significance model feature -implements the inverse document frequency query term for tokens based on an existing or user defined significance model. A siginficance model is a mapping from query terms to a floating point value. The significance model(s) are either provided by Vespa or can be generated using the Vespa-CLI command vespa-significance. - -
The IDF of query term i in field t is currently calculated per field per content node:
- - -- N is the total number of documents on the content node, and n(qi) is the number of documents containing the query term qi for field t. -
- -With the user or Vepsa defined significance models, the IDF calculation can be overridden
- -
-In the following example, we show how to reference a significance model in the service.xml
.
-Note that the field must be enabled for usage with the bm25 feature
-by setting the use-model flag in the
-significance rank-profile
-section of the field definition.
-
-
-<container version="1.0"> - <search> - <significance> - <model model-id="wikimedia"/> - <model url="https://some/uri/bibel-multilingual.json" /> - <model path="models/reddit-norge.no.json.zst" /> - </significance> - </search> -</container> -- - -Note that it is possible to specify multiple significance models in the
service.xml
file.
-
--
-schema example { - document example { - field content type string { - indexing: index | summary - index: enable-bm25 - } - } - rank-profile default { - significance { - use-model: true - } - } -} -- - diff --git a/en/significance.html b/en/significance.html new file mode 100644 index 0000000000..c224879c38 --- /dev/null +++ b/en/significance.html @@ -0,0 +1,103 @@ +--- +# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. +title: "Using Significance Model" +redirect_from: +- /documentation/reference/significance.html +--- +
+The +significance model feature +implements the inverse document frequency query term for tokens based on an existing or user defined significance model. A siginficance model is a mapping from query terms to a floating point value. The significance model(s) are either provided by Vespa or can be generated using the Vespa-CLI command vespa-significance. + +
The bm25 and native rank features uses the significance value of each query term searching an index field when calculating the score of a document. There are short commings with these ranking features, to name a few, the bm25 rank feature suffers from the following limitations: +
+ +By explicitly using a Vespa or user defined significance model, these rank features calculations can be overridden
+ +
+In the following example, we show how to reference a significance model in the service.xml
.
+Note that the field must be enabled for usage with the bm25 feature
+by setting the use-model flag in the
+significance rank-profile
+section of the field definition.
+
+ A significance component is comprised of one or multiple significance models, for one or multiple languages. It uses these models' document frquencies to calculate the inverse document frequency (IDF) of terms in a query. +
+ ++
{% highlight xml %} ++ + +Note that it is possible to specify multiple significance models in the+ +{% endhighlight %}+ ++ ++ + +
service.xml
file.
+
++
+schema example { + document example { + field content type string { + indexing: index | summary + index: enable-bm25 + } + } + rank-profile default { + significance { + use-model: true + } + } +} ++ +
+ The significance model file is a JSON file with the following format: +
{% highlight json %} +{ + "version": 1, + "id": "wikipedia", + "description": "Some optional description", + "languages": { + "en": { + "description": "Some optional description for English model", + "document-count": 1000, + "document-frequencies": { + "and": 500, + "car": 100, + ... + } + }, + "no": { + "description": "Some optional description for Norwegian model", + "document-count": 800, + "document-frequencies": { + "bil": 80, + "og": 400, + ... + } + } + } +}{% endhighlight %}+ +
+Each file contains a map of languages and their document frequencies. The document frequencies are the number of documents in the corpus that contain the term. The document count is the total number of documents in the corpus. + +
+