add significance documentation

vespa-engine · Jun 18, 2024 · 734d55f · 734d55f
1 parent b0a1e72
commit 734d55f
Show file tree

Hide file tree

Showing 7 changed files with 408 additions and 0 deletions.
diff --git a/_data/sidebar.yml b/_data/sidebar.yml
@@ -129,6 +129,8 @@ docs:
         url: /en/stateless-model-evaluation.html
       - page: Ranking With BM25
         url: /en/reference/bm25.html
+      - page: Ranking With Significance Model
+        url: /en/reference/significance.html
       - page: Ranking With nativeRank
         url: /en/nativerank.html
       - page: Accelerated OR search using the WAND algorithm
@@ -410,6 +412,8 @@ docs:
         url: /en/reference/stateless-model-reference.html
       - page: Embedding Model Reference
         url: /en/reference/embedding-reference.html
+      - page: Significance Model Reference
+        url: /en/reference/significance-reference.html
 
   - title: Queries and results reference
     documents:

diff --git a/en/operations-selfhosted/vespa-cmdline-tools.html b/en/operations-selfhosted/vespa-cmdline-tools.html
@@ -1910,6 +1910,56 @@ <h2 id="vespa-set-node-state">vespa-set-node-state</h2>
 
 <!--h2 id="vespa-slobrok-cmd">vespa-slobrok-cmd</h2-->
 
+<h2 id="vepsa-signficance">vespa-significance</h2>
+<p><code>vepsa-signficance</code> is a tool that generates a significance model file based on <a href="">this</a> file format. Its input is a <code>vespa-feed</code> file. 
+</p>
+<p>Synopsis: <code>vespa-significance [options]</code></p>
+<p>Example</p>
+<pre>
+$ vespa-significance --in vespa-dump.jsonl --out en_model.json --field text --language EN --doc-type "en"
+</pre>
+<table class="table">
+  <thead>
+  <tr>
+    <th>Option</th>
+    <th>Description</th>
+  </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <th>-h, --help</th>
+      <td>
+        Help text
+      </td>
+    </tr>
+    <tr>
+      <th>-i, --input &lt;input file&gt;</th>
+      <td>
+        Vespa dump file to be used for generating the significance model
+      </td>
+    </tr><tr>
+      <th>-o, --out &lt;output file&gt;</th>
+      <td>
+        Output file for the significance model
+      </td>
+    </tr><tr>
+      <th> -f, --field &lt;field&gt;</th>
+      <td>
+        Name of the text field to be used for tokenization 
+      </td>
+    </tr><tr>
+      <th> -l, --language &lt;language&gt;</th>
+      <td>
+        Language of the text field, must be a valid language code from the <a href="https://www.rfc-editor.org/rfc/rfc5646">RFC5646</a> standard.
+      </td>
+    </tr><tr>
+      <th> -d, --doc-type &lt;doc-id&gt;</th>
+      <td>
+          Document type identifier for the dump file
+      </td>
+    </tr>
+  </tbody>
+</table>
 
 
 <h2 id="vespa-start-configserver">vespa-start-configserver</h2>

diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html
@@ -132,6 +132,7 @@ <h2 id="elements">Elements</h2>
         <a href="#inputs">inputs</a>
         <a href="#constants">constants</a>
         <a href="#onnx-model">onnx-model</a>
+        <a href="#significance">significance</a>
         <a href="#rank-properties">rank-properties</a>
         <a href="#match-features">match-features</a>
         <a href="#mutate">mutate</a>
@@ -1490,6 +1491,10 @@ <h2 id="rank-profile">rank-profile</h2>
   <td>Zero or many</td>
   <td>An onnx model to make available in this profile.</td>
 </tr>
+<tr><td><a href="#significance">significance</a></td>
+  <td>Zero or one</td>
+  <td>To enable the use of significance models defined in the service.xml config.</td>
+</tr>
 <tr><td><a href="#rank-properties">rank-properties</a></td>
   <td>Zero or one</td>
 <td>List of any rank property key-values to be used by rank features.</td>
@@ -2481,6 +2486,15 @@ <h2 id="onnx-model">onnx-model</h2>
 </table>
 <p>For more details including examples, see <a href="../onnx.html">ranking with ONNX models.</a></p>
 
+<h2 id="significance">significance</h2>
+<p>
+Constrained in <a href="#rank-profile">rank-profile</a>. True or false. By default this is false. When enabled Vespa will use the significance calculation based on provided significance models in the service.xml for the rank-profile it is defined in. 
+<pre>
+significance {
+    use-model: true
+}
+</pre>
+</p>
 
 
 <h2 id="document-summary">document-summary</h2>

diff --git a/en/reference/services-container.html b/en/reference/services-container.html
@@ -29,6 +29,7 @@
         <a href="services-search.html#provider">provider</a>
         <a href="services-processing.html#chain">chain</a>
         <a href="services-search.html#renderer">renderer</a>
+        <a href="services-search.html#significance">siginficance</a>
     <a href="services-docproc.html">document-processing</a>
         <a href="#include">include [dir]</a>
         <a href="services-docproc.html#documentprocessor">documentprocessor</a>

diff --git a/en/reference/services-search.html b/en/reference/services-search.html
@@ -33,6 +33,7 @@
         <a href="#source">source [id]</a>
             <a href="#searcher">searcher [id, class, bundle, provides, before, after]</a>
     <a href="#renderer">renderer [id, class, bundle]</a>
+    <a href="#significance">significance</a>
     <a href="#threadpool">threadpool</a>
         <a href="#threadpool-threads">threads [ boost ]</a>
         <a href="#threadpool-queue">queue</a>
@@ -328,6 +329,20 @@ <h2 id="renderer">renderer</h2>
           bundle="the name in &lt;artifactId&gt; in pom.xml" /&gt;
 </pre>
 
+<h2 id="significance">significance</h2>
+<p>
+The significance tag can include multiple models. Their order determines the model precedence for a given language, with the last element having the highest. The models' document frequency is used to set a token's significance value based on the inverse document frequency (IDF). This is handled by the <a href="../search-definitions.html#rank-profile">Significance Searcher</a> (TODO fix href). To enable the use of these models, the schema needs to have a rank-profile field with the <em>significance</em> element and the  <em>use-model</em> flag set to <em>true</em>.
+</p>
+
+<p>Example of significance model with multiple models. These models are either provided by <em>Vespa</em> or can be generated with the <a href="vespa-cmdline-tools.html#vespa-significance">vepsa-signficance</a> cli. </p>
+<pre data-test="file" data-path="my-app/src/main/application/services.xml">
+&lt;significance&gt;  	
+    &lt;model model-id="wikimedia"/&gt;
+    &lt;model url="https://some/uri/bibel-multilingual.json" /&gt;
+    &lt;model path="models/reddit-norge.no.json.zst" /&gt;
+&lt;/significance&gt;
+</pre>
+</p>
 
 
 <h2 id="chain">chain</h2>

diff --git a/en/reference/significance-reference.html b/en/reference/significance-reference.html
@@ -0,0 +1,107 @@
+---
+# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
+title: "Significance Reference"
+redirect_from:
+- /documentation/reference/significance-reference.html
+---
+
+<p>Reference configuration for <a href="../embedding.html">embedders</a>.</p>
+
+<h2 id="model-config-reference">Model config reference</h2>
+<p>
+  Significance model uses the <a href="config-files.html#model">model</a> type configuration. 
+  The <em>model</em> type configuration accepts attributes <code>model-id</code>, <code>url</code> or <code>path</code>,
+  and multiple of these can be specified as a single config value. The model order determines the precedence for a given language - with the last element having the highest precedence.
+</p>
+  <ul>
+    <li>If a <code>model-id</code> is specified and the application is deployed on Vespa Cloud, the <code>model-id</code> is used.</li>
+    <li>Otherwise, if a <code>url</code> is specified, it is used</li>
+    <li>Otherwise, <code>path</code> is used.</li>
+  </ul>
+<p>
+  When using <code>path</code>, the model files must be supplied in the
+  Vespa <a href="../application-packages.html#deploying-remote-models">application package</a>.
+</p>
+
+<h2 id="significance">Significance</h2>
+<p>
+  A significance component is comprised of one or multiple significance models, for one or multiple languages. It uses these models' document frquencies to calculate the inverse document frequency (IDF) of terms in a query.
+</p>
+<p>
+  The significance component is configured in <a href="services.html">services.xml</a>, with the <code>significance</code> tag:
+</p>
+
+<pre>{% highlight xml %}
+<container version="1.0">
+    <search>
+      <significance>  	
+         <model model-id="wikimedia"/>
+         <model url="https://some/uri/bibel-multilingual.json" />
+         <model path="models/reddit-norge.no.json.zst" />
+      </significance>
+    </search>
+  </container>
+{% endhighlight %}</pre>
+
+<h3 id="significance-reference-config">Significance reference config</h3>
+<table class="table">
+  <thead>
+    <tr>
+      <th>Name</th>
+      <th>Occurrence</th>
+      <th>Description</th>
+      <th>Type</th>
+      <th>Default</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td>model</td>
+      <td>One To Many</td>
+      <td>Use to point to the significance model file</td>
+      <td><a href="#model-config-reference">model-type</a></td>
+      <td>N/A</td>
+    </tr>
+
+  </tbody>
+</table>
+
+
+<h2>Significance Model File format</h2>
+
+<p>
+The significance model file is a JSON file with the following format:
+<pre>{% highlight json %}
+{
+    "version": 1,
+    "id": "wikipedia",
+    "description": "Some optional description",
+    "languages": {
+      "en": {
+        "description": "Some optional description for English model", 
+        "document-count": 1000,
+        "document-frequencies": {
+          "and": 500,
+          "car": 100,
+          ...
+        }
+      },
+      "no": {
+        "description": "Some optional description for Norwegian model", 
+        "document-count": 800,
+        "document-frequencies": {
+          "bil": 80,
+          "og": 400,
+          ...
+        }
+      }
+    }
+  }
+{% endhighlight %}
+</pre>
+</p>
+<p>
+Each file contains a map of languages and their document frequencies.
+
+</p>
+