Skip to content

Commit

Permalink
add significance documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
MariusArhaug committed Jun 18, 2024
1 parent b0a1e72 commit 734d55f
Show file tree
Hide file tree
Showing 7 changed files with 408 additions and 0 deletions.
4 changes: 4 additions & 0 deletions _data/sidebar.yml
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,8 @@ docs:
url: /en/stateless-model-evaluation.html
- page: Ranking With BM25
url: /en/reference/bm25.html
- page: Ranking With Significance Model
url: /en/reference/significance.html
- page: Ranking With nativeRank
url: /en/nativerank.html
- page: Accelerated OR search using the WAND algorithm
Expand Down Expand Up @@ -410,6 +412,8 @@ docs:
url: /en/reference/stateless-model-reference.html
- page: Embedding Model Reference
url: /en/reference/embedding-reference.html
- page: Significance Model Reference
url: /en/reference/significance-reference.html

- title: Queries and results reference
documents:
Expand Down
50 changes: 50 additions & 0 deletions en/operations-selfhosted/vespa-cmdline-tools.html
Original file line number Diff line number Diff line change
Expand Up @@ -1910,6 +1910,56 @@ <h2 id="vespa-set-node-state">vespa-set-node-state</h2>

<!--h2 id="vespa-slobrok-cmd">vespa-slobrok-cmd</h2-->

<h2 id="vepsa-signficance">vespa-significance</h2>
<p><code>vepsa-signficance</code> is a tool that generates a significance model file based on <a href="">this</a> file format. Its input is a <code>vespa-feed</code> file.
</p>
<p>Synopsis: <code>vespa-significance [options]</code></p>
<p>Example</p>
<pre>
$ vespa-significance --in vespa-dump.jsonl --out en_model.json --field text --language EN --doc-type "en"
</pre>
<table class="table">
<thead>
<tr>
<th>Option</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<th>-h, --help</th>
<td>
Help text
</td>
</tr>
<tr>
<th>-i, --input &lt;input file&gt;</th>
<td>
Vespa dump file to be used for generating the significance model
</td>
</tr><tr>
<th>-o, --out &lt;output file&gt;</th>
<td>
Output file for the significance model
</td>
</tr><tr>
<th> -f, --field &lt;field&gt;</th>
<td>
Name of the text field to be used for tokenization
</td>
</tr><tr>
<th> -l, --language &lt;language&gt;</th>
<td>
Language of the text field, must be a valid language code from the <a href="https://www.rfc-editor.org/rfc/rfc5646">RFC5646</a> standard.
</td>
</tr><tr>
<th> -d, --doc-type &lt;doc-id&gt;</th>
<td>
Document type identifier for the dump file
</td>
</tr>
</tbody>
</table>


<h2 id="vespa-start-configserver">vespa-start-configserver</h2>
Expand Down
14 changes: 14 additions & 0 deletions en/reference/schema-reference.html
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,7 @@ <h2 id="elements">Elements</h2>
<a href="#inputs">inputs</a>
<a href="#constants">constants</a>
<a href="#onnx-model">onnx-model</a>
<a href="#significance">significance</a>
<a href="#rank-properties">rank-properties</a>
<a href="#match-features">match-features</a>
<a href="#mutate">mutate</a>
Expand Down Expand Up @@ -1490,6 +1491,10 @@ <h2 id="rank-profile">rank-profile</h2>
<td>Zero or many</td>
<td>An onnx model to make available in this profile.</td>
</tr>
<tr><td><a href="#significance">significance</a></td>
<td>Zero or one</td>
<td>To enable the use of significance models defined in the service.xml config.</td>
</tr>
<tr><td><a href="#rank-properties">rank-properties</a></td>
<td>Zero or one</td>
<td>List of any rank property key-values to be used by rank features.</td>
Expand Down Expand Up @@ -2481,6 +2486,15 @@ <h2 id="onnx-model">onnx-model</h2>
</table>
<p>For more details including examples, see <a href="../onnx.html">ranking with ONNX models.</a></p>

<h2 id="significance">significance</h2>
<p>
Constrained in <a href="#rank-profile">rank-profile</a>. True or false. By default this is false. When enabled Vespa will use the significance calculation based on provided significance models in the service.xml for the rank-profile it is defined in.
<pre>
significance {
use-model: true
}
</pre>
</p>


<h2 id="document-summary">document-summary</h2>
Expand Down
1 change: 1 addition & 0 deletions en/reference/services-container.html
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@
<a href="services-search.html#provider">provider</a>
<a href="services-processing.html#chain">chain</a>
<a href="services-search.html#renderer">renderer</a>
<a href="services-search.html#significance">siginficance</a>
<a href="services-docproc.html">document-processing</a>
<a href="#include">include [dir]</a>
<a href="services-docproc.html#documentprocessor">documentprocessor</a>
Expand Down
15 changes: 15 additions & 0 deletions en/reference/services-search.html
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@
<a href="#source">source [id]</a>
<a href="#searcher">searcher [id, class, bundle, provides, before, after]</a>
<a href="#renderer">renderer [id, class, bundle]</a>
<a href="#significance">significance</a>
<a href="#threadpool">threadpool</a>
<a href="#threadpool-threads">threads [ boost ]</a>
<a href="#threadpool-queue">queue</a>
Expand Down Expand Up @@ -328,6 +329,20 @@ <h2 id="renderer">renderer</h2>
bundle="the name in &lt;artifactId&gt; in pom.xml" /&gt;
</pre>

<h2 id="significance">significance</h2>
<p>
The significance tag can include multiple models. Their order determines the model precedence for a given language, with the last element having the highest. The models' document frequency is used to set a token's significance value based on the inverse document frequency (IDF). This is handled by the <a href="../search-definitions.html#rank-profile">Significance Searcher</a> (TODO fix href). To enable the use of these models, the schema needs to have a rank-profile field with the <em>significance</em> element and the <em>use-model</em> flag set to <em>true</em>.
</p>

<p>Example of significance model with multiple models. These models are either provided by <em>Vespa</em> or can be generated with the <a href="vespa-cmdline-tools.html#vespa-significance">vepsa-signficance</a> cli. </p>
<pre data-test="file" data-path="my-app/src/main/application/services.xml">
&lt;significance&gt;
&lt;model model-id="wikimedia"/&gt;
&lt;model url="https://some/uri/bibel-multilingual.json" /&gt;
&lt;model path="models/reddit-norge.no.json.zst" /&gt;
&lt;/significance&gt;
</pre>
</p>


<h2 id="chain">chain</h2>
Expand Down
107 changes: 107 additions & 0 deletions en/reference/significance-reference.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
---
# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
title: "Significance Reference"
redirect_from:
- /documentation/reference/significance-reference.html
---

<p>Reference configuration for <a href="../embedding.html">embedders</a>.</p>

<h2 id="model-config-reference">Model config reference</h2>
<p>
Significance model uses the <a href="config-files.html#model">model</a> type configuration.
The <em>model</em> type configuration accepts attributes <code>model-id</code>, <code>url</code> or <code>path</code>,
and multiple of these can be specified as a single config value. The model order determines the precedence for a given language - with the last element having the highest precedence.
</p>
<ul>
<li>If a <code>model-id</code> is specified and the application is deployed on Vespa Cloud, the <code>model-id</code> is used.</li>
<li>Otherwise, if a <code>url</code> is specified, it is used</li>
<li>Otherwise, <code>path</code> is used.</li>
</ul>
<p>
When using <code>path</code>, the model files must be supplied in the
Vespa <a href="../application-packages.html#deploying-remote-models">application package</a>.
</p>

<h2 id="significance">Significance</h2>
<p>
A significance component is comprised of one or multiple significance models, for one or multiple languages. It uses these models' document frquencies to calculate the inverse document frequency (IDF) of terms in a query.
</p>
<p>
The significance component is configured in <a href="services.html">services.xml</a>, with the <code>significance</code> tag:
</p>

<pre>{% highlight xml %}
<container version="1.0">
<search>
<significance>
<model model-id="wikimedia"/>
<model url="https://some/uri/bibel-multilingual.json" />
<model path="models/reddit-norge.no.json.zst" />
</significance>
</search>
</container>
{% endhighlight %}</pre>

<h3 id="significance-reference-config">Significance reference config</h3>
<table class="table">
<thead>
<tr>
<th>Name</th>
<th>Occurrence</th>
<th>Description</th>
<th>Type</th>
<th>Default</th>
</tr>
</thead>
<tbody>
<tr>
<td>model</td>
<td>One To Many</td>
<td>Use to point to the significance model file</td>
<td><a href="#model-config-reference">model-type</a></td>
<td>N/A</td>
</tr>

</tbody>
</table>


<h2>Significance Model File format</h2>

<p>
The significance model file is a JSON file with the following format:
<pre>{% highlight json %}
{
"version": 1,
"id": "wikipedia",
"description": "Some optional description",
"languages": {
"en": {
"description": "Some optional description for English model",
"document-count": 1000,
"document-frequencies": {
"and": 500,
"car": 100,
...
}
},
"no": {
"description": "Some optional description for Norwegian model",
"document-count": 800,
"document-frequencies": {
"bil": 80,
"og": 400,
...
}
}
}
}
{% endhighlight %}
</pre>
</p>
<p>
Each file contains a map of languages and their document frequencies.

</p>

Loading

0 comments on commit 734d55f

Please sign in to comment.