diff --git a/_data/sidebar.yml b/_data/sidebar.yml index 2d677c8bd3..f45f4f6241 100644 --- a/_data/sidebar.yml +++ b/_data/sidebar.yml @@ -125,10 +125,12 @@ docs: url: /en/xgboost.html - page: Ranking With LightGBM Models url: /en/lightgbm.html - - page: Stateless model evaluation + - page: Stateless Model Evaluation url: /en/stateless-model-evaluation.html - page: Ranking With BM25 url: /en/reference/bm25.html + - page: Significance Model + url: /en/significance.html - page: Ranking With nativeRank url: /en/nativerank.html - page: Accelerated OR search using the WAND algorithm diff --git a/en/operations-selfhosted/vespa-cmdline-tools.html b/en/operations-selfhosted/vespa-cmdline-tools.html index f7099f87df..469bee9f1c 100644 --- a/en/operations-selfhosted/vespa-cmdline-tools.html +++ b/en/operations-selfhosted/vespa-cmdline-tools.html @@ -15,6 +15,14 @@ use-cases we recommend the Vespa CLI which should work against most Vespa applications regardless of how they are deployed.' %} +

+You can run these tools in Vespa Docker image: +

+

+docker run --entrypoint bash vespaengine/vespa ./opt/vespa/bin/[tool] [args]
+
+

+ @@ -1908,6 +1916,69 @@

vespa-set-node-state

+

vespa-significance

+

+ Generates a significance model file from Vespa documents. + Available in Vespa as of version 8.426.8. +

+

+ The generated model uses the same tokenizer as the default query processor, see linguistics in Vespa for details. + When using a custom tokenizer, the model generator needs to be modified accordingly. + Tokens are converted to lower-case without stemming. + This corresponds to how the model is applied to query terms. +

+

Synopsis: vespa-significance generate [options]

+

Example:

+
+$ vespa-significance generate --in vespa-dump.jsonl --out en_model.json --field text --language en
+
+

When running in Docker, it is useful to mount a folder with vespa-feed documents and to store the model file, e.g.:

+
+$ podman run -it --entrypoint bash -v $PWD/data:/data -w /data vespaengine/vespa:latest /opt/vespa/bin/vespa-significance generate --in docs.jsonl --out model.zst --field text --language en
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
OptionDescription
-h, --helpHelp text
-i, --in <input file> + JSON Lines (JSONL) file where each line is a Vespa document in JSON format. +
-o, --out <output file> + Significance model file in JSON format. +
-f, --field <field> + Name of the text field to use for significance model. +
-l, --language <language> + Language of the text field specified as a code, e.g. en for English.
+ It is used by OpenNLP tokenizer; see supported languages with codes here. +
--zst <compression> + If set to true compresses the output file with zstandard. + Default false. +

vespa-start-configserver

diff --git a/en/reference/query-api-reference.html b/en/reference/query-api-reference.html index 1c97f46c2c..5f4d762306 100644 --- a/en/reference/query-api-reference.html +++ b/en/reference/query-api-reference.html @@ -83,6 +83,7 @@

Parameters

  • ranking.globalPhase.rerankCount
  • ranking.matching
  • ranking.matchPhase
  • +
  • ranking.significance.useModel
  • @@ -697,6 +698,18 @@

    Ranking

    + + ranking.significance.useModel + + Boolean + false + +

    + Enables or disables the use of significance models specified in service.xml. + Overrides use-model set in the rank profile. +

    + + ranking.freshness diff --git a/en/reference/schema-reference.html b/en/reference/schema-reference.html index 6e14b66c50..323ae5e919 100644 --- a/en/reference/schema-reference.html +++ b/en/reference/schema-reference.html @@ -132,6 +132,7 @@

    Elements

    inputs constants onnx-model + significance rank-properties match-features mutate @@ -1493,6 +1494,10 @@

    rank-profile

    Zero or many An onnx model to make available in this profile. +significance + Zero or one + To enable the use of significance models defined in the service.xml config. + rank-properties Zero or one List of any rank property key-values to be used by rank features. @@ -2484,6 +2489,37 @@

    onnx-model

    For more details including examples, see ranking with ONNX models.

    +

    significance

    +

    +Contained in rank-profile. +Configures a significance model. +

    +significance {
    +    use-model: true
    +}
    +
    +

    + +

    +The body must contain: + + + + + + + + + + + + + +
    nameoccurrencedescription
    use-modelOneEnable or disable the use of significance models specified in service.xml.
    +

    +

    + For more details see Significance Model. +

    document-summary

    diff --git a/en/reference/services-container.html b/en/reference/services-container.html index ef15995690..0ff1a96338 100644 --- a/en/reference/services-container.html +++ b/en/reference/services-container.html @@ -27,6 +27,7 @@ chain renderer threadpool + significance document-processing include [dir] documentprocessor diff --git a/en/reference/services-search.html b/en/reference/services-search.html index 321808a402..b187493bd3 100644 --- a/en/reference/services-search.html +++ b/en/reference/services-search.html @@ -33,6 +33,7 @@ source [id] searcher [id, class, bundle, provides, before, after] renderer [id, class, bundle] + significance threadpool threads [ boost ] queue @@ -328,7 +329,62 @@

    renderer

    bundle="the name in <artifactId> in pom.xml" /> +

    significance

    +

    +Contained in searcher. +Specifies one or more global significance models. +

    + +
    +<significance>
    +    <model model-id="significance-en-wikipedia-v1"/>
    +    <model url="https://some/uri/my-model.model.multilingual.json"/>
    +    <model path="models/my-model.no.json.zst"/>
    +</significance>
    +
    + +

    +The models are either provided by Vespa or generated with vespa-signficance tool. +The order determines model precedence - with the last element having the highest priority. +To use these models, schema needs to enable significance models in the rank-profile. +

    + +

    +Sub-elements: +

    +

    + +

    model

    +

    +Contained in significance. +Specifies global significance model. +Models are identified by model-id or by providing url or path to a model file in the application package. +

    +

    +Models with model-id are provided by Vespa and listed here. +Example with model-id: +

    +<model model-id="significance-en-wikipedia-v1"/>
    +
    +

    +

    +A model specified with url and path are JSON files, which can be also compressed with zstandard. +Model files can be generated using vespa-signficance tool. +Example with url: +

    +<model url="https://some/uri/mymodel.multilingual.json"/>
    +
    + +Models with path should be placed in the application package. +The path is relative to the application package root. +Example with path: +
    +<model path="models/mymodel.no.json.zst"/>
    +
    +

    chain

    @@ -339,7 +395,7 @@

    chain

    Note that provider and source elements are also chains. Specify a search chain in a query using searchChain.

    -

    Example which inherits from the built in vespa chain so that +

    Example which inherits from the built in vespa chain so that the searcher can dispatch queries to the content clusters:

     <chain id="common" inherits="vespa">
    diff --git a/en/significance.md b/en/significance.md
    new file mode 100644
    index 0000000000..b0b349d7fa
    --- /dev/null
    +++ b/en/significance.md
    @@ -0,0 +1,177 @@
    +---
    +# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root.
    +title: "Significance Model"
    +---
    +
    +*Significance* is a measure of how rare a term is in a collection of documents.
    +Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the".
    +Significance is often calculated as the inverse document frequency (IDF):
    +
    +$$ IDF(t, N) = log(\frac{N}{n_t}) $$
    +
    +where:
    +- $$ N $$ is the total number of documents in the collection
    +- $$ n_t $$ is the number of documents containing the term $$ t $$
    +
    +Variations of IDF are used in [bm25](reference/bm25.html) and [nativeRank](reference/nativerank.html).
    +
    +*Significance model* provides the data necessary to calculate IDF, i.e. $$ n_t $$ for each term and $$ N $$ for the document collection.
    +We distinguish between *local and global* significance models.
    +A local model is node-specific and a global model is shared across nodes.
    +
    +# Local significance model
    +
    +For `string` fields indexed with [bm25](reference/bm25.html) or [nativeRank](reference/nativerank.html),
    +Vespa creates a local significance model on each content node.
    +Each node uses its own local model for the queries it processes.
    +
    +Different nodes can have different significance values for the same term.
    +In large collections, this difference is usually small and doesn’t affect ranking quality.
    +
    +One issue with the local models is that ranking is non-deterministic in the following cases:
    +1. When new documents are added, local models on affected content nodes are updated.
    +2. When the content cluster [redistributes documents](elasticity.html) across nodes, e.g. adding, removing nodes for scaling and failure recovery, the models change on the nodes involved.
    +3. When using [grouped distribution](elasticity.html#grouped-distribution),
    +queries can return different results depending on which group processes them.
    +
    +Another issue is that local significance models are not available in [streaming search](streaming-search.html)
    +because inverted indexes are not constructed so IDF values can't be extracted.
    +All significance values are set to 1, which is the default value for unknown terms.
    +The lack of significance values may degrade the ranking quality.
    +
    +A global significance model addresses these issues.
    +
    +# Global significance model
    +
    +In a *global significance model*, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model:
    +
    +1. Include [significance values in a query](#significance-values-in-a-query).
    +2. Set [significance values in a searcher](#significance-values-in-a-searcher).
    +3. Specify [models in services.xml](#significance-models-in-servicesxml).
    +
    +## Significance values in a query
    +
    +Document frequency and document count can be specified in YQL, e.g.:
    +```sql
    +select * from example where content contains ({documentFrequency: {frequency: 13, count: 101}}"colors")
    +```
    +
    +Alternatively, significance values can be specified in YQL directly and used instead of computed IDF values, e.g.:
    +```sql
    +select * from example where content contains ({significance:0.9}"neurotransmitter")
    +```
    +
    +## Significance values in a searcher
    +
    +Document frequency and significance values can be also set in a [custom searcher](https://docs.vespa.ai/en/searcher-development.html#writing-a-searcher):
    +
    +```java
    +private void setDocumentFrequency(WordItem item, long frequency, long numDocuments) {
    +    var word = item.getWord();
    +    word.setDocumentFrequency(new DocumentFrequency(frequency, numDocuments));
    +}
    +
    +private void setSignificance(WordItem item, float significance) {
    +    var word = item.getWord();
    +    word.setSignificance(significance);
    +}
    +```
    +
    +
    +## Significance models in services.xml
    +
    +[`significance` element in services.xml](reference/services-search.html#significance) specifies one or more models:
    +
    +```xml
    +
    +    
    +        
    +            
    +            
    +            
    +        
    +    
    +
    +```
    +
    +Vespa Cloud users have access to [pre-built models](https://cloud.vespa.ai/en/model-hub#significance-models), identified by `model-id`.
    +In addition, all users can specify their own models by providing a `url` to an external resource or a `path` to a model file within the application package.
    +Vespa provides a [command line tool](operations-selfhosted/vespa-cmdline-tools.html#vespa-significance) to generate [model files](#significance-model-file) from documents.
    +The order in which the models are specified determines the model precedence, see [model resolution](#model-resolution) for details.
    +
    +In addition to adding models in [services.xml](reference/services-search.html#significance),
    +the `significance` feature must be enabled in the [`rank-profile` section of the schema](reference/schema-reference.html#significance), e.g.
    +
    +```xml
    +schema example {
    +    document example {
    +        field content type string {
    +            indexing: index | summary
    +            index: enable-bm25
    +        }
    +    }
    +
    +    rank-profile default {
    +        significance {
    +            use-model: true
    +        }
    +    }
    +}
    +```
    +
    +The model will be applied to all query terms except those that already have significance values from the query.
    +
    +Specifying significance models in services.xml is available in Vespa as of version 8.426.8.
    +
    +### Significance model file
    +
    +The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g.
    +
    +```json
    +{
    +  "version": 1,
    +  "id": "wikipedia",
    +  "description": "Some optional description",
    +  "languages": {
    +    "en": {
    +      "description": "Some optional description for English model",
    +      "document-count": 1000,
    +      "document-frequencies": {
    +        "and": 500,
    +        "car": 100,
    +        ...
    +      }
    +    },
    +    "no": {
    +      "description": "Some optional description for Norwegian model",
    +      "document-count": 800,
    +      "document-frequencies": {
    +        "bil": 80,
    +        "og": 400,
    +        ...
    +      }
    +    }
    +  }
    +}
    +```
    +
    +A significance model file can be compressed with [zstandard](https://facebook.github.io/zstd/)
    +when included in the application package or made available via a URL.
    +
    +Vespa provides a [CLI tool for generating model files](operations-selfhosted/vespa-cmdline-tools.html#vespa-significance) from Vespa documents.
    +It is uses the same linguistic module as in query processing to extract tokens and their document frequencies.
    +
    +### Model resolution
    +
    +Model resolution selects a model from the models specified in [services.xml](#significance-models-in-servicesxml) based on the language of the query.
    +The language can be either [explicitly tagged](reference/query-api-reference.html#model.language) or [implicitly detected](linguistics.html#query-language-detection).
    +
    +The resolution logic is as follows:
    +- When language is explicitly tagged
    +  - Select the last specified model that has the tagged language.
    +    Fail if none are available.
    +  - If the language is tagged as “un” (unknown), select the model for “un” first, fall back to “en” (English).
    +    Fail if none are available.
    +- When language is implicitly detected
    +  - Select the last specified model with the detected language. If not available, try “un” and then “en” languages.
    +    Fail if none are available.
    diff --git a/en/streaming-search.html b/en/streaming-search.html
    index c86391b622..2ff7e13ada 100644
    --- a/en/streaming-search.html
    +++ b/en/streaming-search.html
    @@ -48,9 +48,9 @@ 
         
         
  • Since there are no indexes, the content nodes do not collect term statistics and average field length statistics.