Skip to content

Commit

Permalink
revised docs
Browse files Browse the repository at this point in the history
  • Loading branch information
glebashnik committed Oct 4, 2024
1 parent c9edc02 commit 64ef5e8
Show file tree
Hide file tree
Showing 2 changed files with 40 additions and 26 deletions.
37 changes: 24 additions & 13 deletions en/operations-selfhosted/vespa-cmdline-tools.html
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@
use-cases we recommend the <a href="../vespa-cli.html">Vespa CLI</a> which
should work against most Vespa applications regardless of how they are deployed.' %}

<p>
You can run these tools in <a href="https://hub.docker.com/r/vespaengine/vespa/tags">Vespa Docker image</a>:
<p>
<pre>
docker run --entrypoint bash vespaengine/vespa ./opt/vespa/bin/[tool] [args]
</pre>
</p>


<!--h2 id="vespa-config-ctl">vespa-config-ctl</h2-->
<!--h2 id="vespa-config-loadtester">vespa-config-loadtester</h2-->
Expand Down Expand Up @@ -1911,12 +1919,18 @@ <h2 id="vespa-set-node-state">vespa-set-node-state</h2>
<!--h2 id="vespa-slobrok-cmd">vespa-slobrok-cmd</h2-->

<h2 id="vespa-significance">vespa-significance</h2>
<p>The <em>vespa-signficance</em> cli is a tool that generates a significance model <a href="../reference/significance-reference.html#significance-file-format">file</a>. Its input is a <a href="../reference/document-json-format.html"><em>vespa-feed</em></a> file.
<p>
<em>vespa-signficance</em> generates a <a href="../significance.html#significance-model-file">significance model file</a>
from documents in a <a href="../reference/document-json-format.html">vespa-feed file</a>.
</p>
<p>Synopsis: <code>vespa-significance [options]</code></p>
<p>Example</p>
<p>Synopsis: <code>vespa-significance generate [options]</code></p>
<p>Example:</p>
<pre>
$ vespa-significance --in vespa-dump.jsonl --out en_model.json --field text --language EN --doc-type "en"
$ vespa-significance generate --in vespa-dump.jsonl --out en_model.json --field text --language en --doc-type en
</pre>
<p>When running in Docker, it is useful to mount a folder with vespa-feed documents and to store the model file, e.g.:</p>
<pre>
$ podman run -it --entrypoint bash -v $PWD/data:/data -w /data vespaengine/vespa /opt/vespa/bin/vespa-significance generate -i docs.jsonl -o model.zst -f text -l en -d en
</pre>
<table class="table">
<thead>
Expand All @@ -1933,33 +1947,30 @@ <h2 id="vespa-significance">vespa-significance</h2>
<tr>
<th>-i, --in &lt;input file&gt;</th>
<td>
<a href="../reference/document-json-format.html">Vespa-feed</a> file to be used for generating the significance model
<a href="../reference/document-json-format.html">vespa-feed file</a> with documents in JSON or JSONL format.
</td>
</tr><tr>
<th>-o, --out &lt;output file&gt;</th>
<td>
Output file for the significance model, with <a href="../significance.html#significance-file-format">this</a> JSON file format
<a href="../significance.html#significance-model-file">Significance model file</a> in JSON format.
</td>
</tr><tr>
<th> -f, --field &lt;field&gt;</th>
<td>
Name of the text field to be used for significance model
Name of the text field to use for significance model.
</td>
</tr><tr>
<th> -l, --language &lt;language&gt;</th>
<td>
<p>
Language of the text field, must be a valid language code from the <a href="https://www.rfc-editor.org/rfc/rfc5646">RFC5646</a> standard.
<br >
It is used with
OpenNLP's tokenizer to tokenize the text field based on that language's rules.
</p>
Language of the text field specified as a code, e.g. <code>en</code> for English.</br>
It is used by OpenNLP tokenizer; see supported languages with codes <a href="../linguistics.html#default-languages">here</a>.
</td>
</tr><tr>
<th> -d, --doc-type &lt;doc-id&gt;</th>
<td>
<p>Document type identifier for the vespa dump file. <br>
It becomes a part of the id for <a href="../reference/document-json-format.html#put">put</a> operations in the vespa-feed file. <code>&#123; "put": "id::&lt;doc-id&gt;::1" &#125; </code>
It becomes part of the id for <a href="../reference/document-json-format.html#put">put</a> operations in the vespa-feed file. <code>&#123; "put": "id::&lt;doc-id&gt;::1" &#125; </code>
</p>
</td>
</tr>
Expand Down
29 changes: 16 additions & 13 deletions en/significance.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,7 @@ title: "Significance Model"
---

*Significance* is a measure of how rare a term is in a collection of documents.
Rare terms like "neurotransmitter" get more weight during ranking than common terms like "the".

Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the".
Significance is calculated as inverse document frequency (IDF):

$$ IDF(t, N) = log(\frac{N}{n_t}) $$
Expand All @@ -17,9 +16,8 @@ where:
Variations of IDF are used in [bm25](reference/bm25.html) and [nativeRank](reference/nativerank.html).

*Significance model* provides the data necessary to calculate IDF, i.e. $$ n_t $$ for each term and $$ N $$ for the document collection.

We distinguish between *local and global* significance models.
Local models are node-specific and a global model is shared across nodes.
A local model is node-specific and a global model is shared across nodes.

# Local significance model

Expand All @@ -46,8 +44,8 @@ A global significance model addresses these issues.

In a *global significance model*, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model:

1. Add significance values to a query.
2. Specify models in [services.xml](reference/services.html).
1. Include [significance values in a query](#significance-values-in-a-query).
2. Specify [models in services.xml](#significance-models-in-servicesxml).


## Significance values in a query
Expand All @@ -69,7 +67,7 @@ private void setIDF(WordItem item, frequency: long, numDocuments: long) {

## Significance models in services.xml

The `significance` element in [services.xml#significance](reference/services.html) specifies one or more models:
The `significance` element in [services.xml](reference/services-search.html#significance) specifies one or more models:

```xml
<container version="1.0">
Expand All @@ -88,8 +86,8 @@ The `path` should be relative to the package root.
The order in which the models are specified determines the model precedence, with the last model overriding the previous ones.
See [model resolution](#model-resolution).

In addition to adding models in [services.xml](reference/services.html),
the `significance` feature must be enabled in the `rank-profile` section of the schema, e.g.
In addition to adding models in [services.xml](reference/services-search.html#significance),
the `significance` feature must be enabled in the [`rank-profile` section of the schema](reference/schema-reference.html#significance), e.g.

```xml
schema example {
Expand All @@ -110,7 +108,7 @@ schema example {

The model will be applied to all query terms except those that already have significance values from the query.

### Significance model file format
### Significance model file

The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g.

Expand Down Expand Up @@ -142,14 +140,19 @@ The significance model file is a JSON file that contains term document frequenci
}
```

Significance model files can be compressed with <a href="https://facebook.github.io/zstd/">zstandard</a>.

Vespa provides a <a href="operations-selfhosted/vespa-cmdline-tools.html#vespa-significance">CLI tool for generating model files from vespa-feed document files</a>.

### Model resolution

Model resolution selects a model from the models specified in [services.xml](reference/services.html) based on the language of the query.
The language can be either explicitly tagged or implicitly detected.
Model resolution selects a model from the models specified in [services.xml](#significance-models-in-servicesxml) based on the language of the query.
The language can be either [explicitly tagged](reference/query-api-reference.html#model.language) or [implicitly detected](linguistics.html#query-language-detection).

The resolution logic is as follows:
- When language is explicitly tagged
- Select the last specified model with the tagged language. Fail if none are available.
- Select the last model specified in [services.xml](#significance-models-in-servicesxml) that has the tagged language.
Fail if none are available.
- If the language is tagged as “un” (unknown), select the model for “un” first, fall back to “en” (english).
Fail if none are available.
- When language is implicitly detected
Expand Down

0 comments on commit 64ef5e8

Please sign in to comment.