revised docs

vespa-engine · Oct 4, 2024 · 64ef5e8 · 64ef5e8
1 parent c9edc02
commit 64ef5e8
Show file tree

Hide file tree

Showing 2 changed files with 40 additions and 26 deletions.
diff --git a/en/operations-selfhosted/vespa-cmdline-tools.html b/en/operations-selfhosted/vespa-cmdline-tools.html
@@ -15,6 +15,14 @@
 use-cases we recommend the <a href="../vespa-cli.html">Vespa CLI</a> which
 should work against most Vespa applications regardless of how they are deployed.' %}
 
+<p>
+You can run these tools in <a href="https://hub.docker.com/r/vespaengine/vespa/tags">Vespa Docker image</a>:
+<p>
+<pre>
+docker run --entrypoint bash vespaengine/vespa ./opt/vespa/bin/[tool] [args]
+</pre>
+</p>
+
 
 <!--h2 id="vespa-config-ctl">vespa-config-ctl</h2-->
 <!--h2 id="vespa-config-loadtester">vespa-config-loadtester</h2-->
@@ -1911,12 +1919,18 @@ <h2 id="vespa-set-node-state">vespa-set-node-state</h2>
 <!--h2 id="vespa-slobrok-cmd">vespa-slobrok-cmd</h2-->
 
 <h2 id="vespa-significance">vespa-significance</h2>
-<p>The <em>vespa-signficance</em> cli is a tool that generates a significance model <a href="../reference/significance-reference.html#significance-file-format">file</a>. Its input is a <a href="../reference/document-json-format.html"><em>vespa-feed</em></a> file. 
+<p>
+    <em>vespa-signficance</em> generates a <a href="../significance.html#significance-model-file">significance model file</a>
+    from documents in a <a href="../reference/document-json-format.html">vespa-feed file</a>.
 </p>
-<p>Synopsis: <code>vespa-significance [options]</code></p>
-<p>Example</p>
+<p>Synopsis: <code>vespa-significance generate [options]</code></p>
+<p>Example:</p>
 <pre>
-$ vespa-significance --in vespa-dump.jsonl --out en_model.json --field text --language EN --doc-type "en"
+$ vespa-significance generate --in vespa-dump.jsonl --out en_model.json --field text --language en --doc-type en
+</pre>
+<p>When running in Docker, it is useful to mount a folder with vespa-feed documents and to store the model file, e.g.:</p>
+<pre>
+$ podman run -it --entrypoint bash -v $PWD/data:/data -w /data vespaengine/vespa /opt/vespa/bin/vespa-significance generate -i docs.jsonl -o model.zst -f text -l en -d en
 </pre>
 <table class="table">
   <thead>
@@ -1933,33 +1947,30 @@ <h2 id="vespa-significance">vespa-significance</h2>
     <tr>
       <th>-i, --in &lt;input file&gt;</th>
       <td>
-        <a href="../reference/document-json-format.html">Vespa-feed</a>  file to be used for generating the significance model
+        <a href="../reference/document-json-format.html">vespa-feed file</a> with documents in JSON or JSONL format.
       </td>
     </tr><tr>
       <th>-o, --out &lt;output file&gt;</th>
       <td>
-        Output file for the significance model, with <a href="../significance.html#significance-file-format">this</a> JSON file format
+        <a href="../significance.html#significance-model-file">Significance model file</a> in JSON format.
       </td>
     </tr><tr>
       <th> -f, --field &lt;field&gt;</th>
       <td>
-        Name of the text field to be used for significance model 
+        Name of the text field to use for significance model.
       </td>
     </tr><tr>
       <th> -l, --language &lt;language&gt;</th>
       <td>
         <p>
-          Language of the text field, must be a valid language code from the <a href="https://www.rfc-editor.org/rfc/rfc5646">RFC5646</a> standard. 
-        <br >
-          It is used with
-          OpenNLP's tokenizer to tokenize the text field based on that language's rules.
-        </p>
+          Language of the text field specified as a code, e.g. <code>en</code> for English.</br>
+          It is used by OpenNLP tokenizer; see supported languages with codes <a href="../linguistics.html#default-languages">here</a>.
       </td>
     </tr><tr>
       <th> -d, --doc-type &lt;doc-id&gt;</th>
       <td>
         <p>Document type identifier for the vespa dump file. <br>
-          It becomes a part of the id for <a href="../reference/document-json-format.html#put">put</a> operations in the vespa-feed file. <code>&#123; "put": "id::&lt;doc-id&gt;::1" &#125; </code>
+          It becomes part of the id for <a href="../reference/document-json-format.html#put">put</a> operations in the vespa-feed file. <code>&#123; "put": "id::&lt;doc-id&gt;::1" &#125; </code>
         </p>
       </td>
     </tr>

diff --git a/en/significance.md b/en/significance.md
@@ -4,8 +4,7 @@ title: "Significance Model"
 ---
 
 *Significance* is a measure of how rare a term is in a collection of documents.
-Rare terms like "neurotransmitter" get more weight during ranking than common terms like "the".
-
+Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the".
 Significance is calculated as inverse document frequency (IDF):
 
 $$ IDF(t, N) = log(\frac{N}{n_t}) $$
@@ -17,9 +16,8 @@ where:
 Variations of IDF are used in [bm25](reference/bm25.html) and [nativeRank](reference/nativerank.html).
 
 *Significance model* provides the data necessary to calculate IDF, i.e. $$ n_t $$ for each term and $$ N $$ for the document collection.
-
 We distinguish between *local and global* significance models.
-Local models are node-specific and a global model is shared across nodes.
+A local model is node-specific and a global model is shared across nodes.
 
 # Local significance model
 
@@ -46,8 +44,8 @@ A global significance model addresses these issues.
 
 In a *global significance model*, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model:
 
-1. Add significance values to a query.
-2. Specify models in [services.xml](reference/services.html).
+1. Include [significance values in a query](#significance-values-in-a-query).
+2. Specify [models in services.xml](#significance-models-in-servicesxml).
 
 
 ## Significance values in a query
@@ -69,7 +67,7 @@ private void setIDF(WordItem item, frequency: long, numDocuments: long) {
 
 ## Significance models in services.xml
 
-The `significance` element in [services.xml#significance](reference/services.html) specifies one or more models:
+The `significance` element in [services.xml](reference/services-search.html#significance) specifies one or more models:
 
 ```xml
 <container version="1.0">
@@ -88,8 +86,8 @@ The `path` should be relative to the package root.
 The order in which the models are specified determines the model precedence, with the last model overriding the previous ones.
 See [model resolution](#model-resolution).
 
-In addition to adding models in [services.xml](reference/services.html),
-the `significance` feature must be enabled in the `rank-profile` section of the schema, e.g.
+In addition to adding models in [services.xml](reference/services-search.html#significance),
+the `significance` feature must be enabled in the [`rank-profile` section of the schema](reference/schema-reference.html#significance), e.g.
 
 ```xml
 schema example {
@@ -110,7 +108,7 @@ schema example {
 
 The model will be applied to all query terms except those that already have significance values from the query.
 
-### Significance model file format
+### Significance model file
 
 The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g.
 
@@ -142,14 +140,19 @@ The significance model file is a JSON file that contains term document frequenci
 }
 ```
 
+Significance model files can be compressed with <a href="https://facebook.github.io/zstd/">zstandard</a>.
+
+Vespa provides a <a href="operations-selfhosted/vespa-cmdline-tools.html#vespa-significance">CLI tool for generating model files from vespa-feed document files</a>.
+
 ### Model resolution
 
-Model resolution selects a model from the models specified in [services.xml](reference/services.html) based on the language of the query.
-The language can be either explicitly tagged or implicitly detected.
+Model resolution selects a model from the models specified in [services.xml](#significance-models-in-servicesxml) based on the language of the query.
+The language can be either [explicitly tagged](reference/query-api-reference.html#model.language) or [implicitly detected](linguistics.html#query-language-detection).
 
 The resolution logic is as follows:
 - When language is explicitly tagged
-  - Select the last specified model with the tagged language. Fail if none are available.
+  - Select the last model specified in [services.xml](#significance-models-in-servicesxml) that has the tagged language.
+    Fail if none are available.
   - If the language is tagged as “un” (unknown), select the model for “un” first, fall back to “en” (english).
     Fail if none are available.
 - When language is implicitly detected