diff --git a/_data/sidebar.yml b/_data/sidebar.yml index 2d677c8bd3..f45f4f6241 100644 --- a/_data/sidebar.yml +++ b/_data/sidebar.yml @@ -125,10 +125,12 @@ docs: url: /en/xgboost.html - page: Ranking With LightGBM Models url: /en/lightgbm.html - - page: Stateless model evaluation + - page: Stateless Model Evaluation url: /en/stateless-model-evaluation.html - page: Ranking With BM25 url: /en/reference/bm25.html + - page: Significance Model + url: /en/significance.html - page: Ranking With nativeRank url: /en/nativerank.html - page: Accelerated OR search using the WAND algorithm diff --git a/en/operations-selfhosted/vespa-cmdline-tools.html b/en/operations-selfhosted/vespa-cmdline-tools.html index f7099f87df..469bee9f1c 100644 --- a/en/operations-selfhosted/vespa-cmdline-tools.html +++ b/en/operations-selfhosted/vespa-cmdline-tools.html @@ -15,6 +15,14 @@ use-cases we recommend the Vespa CLI which should work against most Vespa applications regardless of how they are deployed.' %} +
+You can run these tools in Vespa Docker image: +
+
+docker run --entrypoint bash vespaengine/vespa ./opt/vespa/bin/[tool] [args] ++ + @@ -1908,6 +1916,69 @@
+ Generates a significance model file from Vespa documents. + Available in Vespa as of version 8.426.8. +
++ The generated model uses the same tokenizer as the default query processor, see linguistics in Vespa for details. + When using a custom tokenizer, the model generator needs to be modified accordingly. + Tokens are converted to lower-case without stemming. + This corresponds to how the model is applied to query terms. +
+Synopsis: vespa-significance generate [options]
Example:
++$ vespa-significance generate --in vespa-dump.jsonl --out en_model.json --field text --language en ++
When running in Docker, it is useful to mount a folder with vespa-feed documents and to store the model file, e.g.:
++$ podman run -it --entrypoint bash -v $PWD/data:/data -w /data vespaengine/vespa:latest /opt/vespa/bin/vespa-significance generate --in docs.jsonl --out model.zst --field text --language en ++
Option | +Description | +
---|---|
-h, --help | +Help text | +
-i, --in <input file> | ++ JSON Lines (JSONL) file where each line is a Vespa document in JSON format. + | +
-o, --out <output file> | ++ Significance model file in JSON format. + | +
-f, --field <field> | ++ Name of the text field to use for significance model. + | +
-l, --language <language> | +
+ Language of the text field specified as a code, e.g. en for English.
+ It is used by OpenNLP tokenizer; see supported languages with codes here.
+ |
+
--zst <compression> | +
+ If set to true compresses the output file with zstandard.
+ Default false .
+ |
+
+ Enables or disables the use of significance models specified in service.xml. + Overrides use-model set in the rank profile. +
+For more details including examples, see ranking with ONNX models.
++Contained in rank-profile. +Configures a significance model. +
+significance { + use-model: true +} ++ + +
+The body must contain: +
name | +occurrence | +description |
---|---|---|
use-model | +One | +Enable or disable the use of significance models specified in service.xml. | +
+ For more details see Significance Model. +
+Contained in searcher. +Specifies one or more global significance models. +
+ ++<significance> + <model model-id="significance-en-wikipedia-v1"/> + <model url="https://some/uri/my-model.model.multilingual.json"/> + <model path="models/my-model.no.json.zst"/> +</significance> ++ +
+The models are either provided by Vespa or generated with vespa-signficance tool. +The order determines model precedence - with the last element having the highest priority. +To use these models, schema needs to enable significance models in the rank-profile. +
+ ++Sub-elements: +
+Contained in significance.
+Specifies global significance model.
+Models are identified by model-id
or by providing url
or path
to a model file in the application package.
+
+Models with model-id
are provided by Vespa and listed here.
+Example with model-id
:
+
+<model model-id="significance-en-wikipedia-v1"/> ++ +
+A model specified with url
and path
are JSON files, which can be also compressed with zstandard.
+Model files can be generated using vespa-signficance tool.
+Example with url
:
+
+<model url="https://some/uri/mymodel.multilingual.json"/> ++ +Models with
path
should be placed in the application package.
+The path is relative to the application package root.
+Example with path
:
++<model path="models/mymodel.no.json.zst"/> ++
@@ -339,7 +395,7 @@
Example which inherits from the built in vespa chain so that +
Example which inherits from the built in vespa chain so that the searcher can dispatch queries to the content clusters:
<chain id="common" inherits="vespa"> diff --git a/en/significance.md b/en/significance.md new file mode 100644 index 0000000000..b0b349d7fa --- /dev/null +++ b/en/significance.md @@ -0,0 +1,177 @@ +--- +# Copyright Vespa.ai. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. +title: "Significance Model" +--- + +*Significance* is a measure of how rare a term is in a collection of documents. +Rare terms like "neurotransmitter" are weighted higher during ranking than common terms like "the". +Significance is often calculated as the inverse document frequency (IDF): + +$$ IDF(t, N) = log(\frac{N}{n_t}) $$ + +where: +- $$ N $$ is the total number of documents in the collection +- $$ n_t $$ is the number of documents containing the term $$ t $$ + +Variations of IDF are used in [bm25](reference/bm25.html) and [nativeRank](reference/nativerank.html). + +*Significance model* provides the data necessary to calculate IDF, i.e. $$ n_t $$ for each term and $$ N $$ for the document collection. +We distinguish between *local and global* significance models. +A local model is node-specific and a global model is shared across nodes. + +# Local significance model + +For `string` fields indexed with [bm25](reference/bm25.html) or [nativeRank](reference/nativerank.html), +Vespa creates a local significance model on each content node. +Each node uses its own local model for the queries it processes. + +Different nodes can have different significance values for the same term. +In large collections, this difference is usually small and doesn’t affect ranking quality. + +One issue with the local models is that ranking is non-deterministic in the following cases: +1. When new documents are added, local models on affected content nodes are updated. +2. When the content cluster [redistributes documents](elasticity.html) across nodes, e.g. adding, removing nodes for scaling and failure recovery, the models change on the nodes involved. +3. When using [grouped distribution](elasticity.html#grouped-distribution), +queries can return different results depending on which group processes them. + +Another issue is that local significance models are not available in [streaming search](streaming-search.html) +because inverted indexes are not constructed so IDF values can't be extracted. +All significance values are set to 1, which is the default value for unknown terms. +The lack of significance values may degrade the ranking quality. + +A global significance model addresses these issues. + +# Global significance model + +In a *global significance model*, significance values are shared across nodes and don’t change when new documents are added. There are two ways to provide a global model: + +1. Include [significance values in a query](#significance-values-in-a-query). +2. Set [significance values in a searcher](#significance-values-in-a-searcher). +3. Specify [models in services.xml](#significance-models-in-servicesxml). + +## Significance values in a query + +Document frequency and document count can be specified in YQL, e.g.: +```sql +select * from example where content contains ({documentFrequency: {frequency: 13, count: 101}}"colors") +``` + +Alternatively, significance values can be specified in YQL directly and used instead of computed IDF values, e.g.: +```sql +select * from example where content contains ({significance:0.9}"neurotransmitter") +``` + +## Significance values in a searcher + +Document frequency and significance values can be also set in a [custom searcher](https://docs.vespa.ai/en/searcher-development.html#writing-a-searcher): + +```java +private void setDocumentFrequency(WordItem item, long frequency, long numDocuments) { + var word = item.getWord(); + word.setDocumentFrequency(new DocumentFrequency(frequency, numDocuments)); +} + +private void setSignificance(WordItem item, float significance) { + var word = item.getWord(); + word.setSignificance(significance); +} +``` + + +## Significance models in services.xml + +[`significance` element in services.xml](reference/services-search.html#significance) specifies one or more models: + +```xml ++ +``` + +Vespa Cloud users have access to [pre-built models](https://cloud.vespa.ai/en/model-hub#significance-models), identified by `model-id`. +In addition, all users can specify their own models by providing a `url` to an external resource or a `path` to a model file within the application package. +Vespa provides a [command line tool](operations-selfhosted/vespa-cmdline-tools.html#vespa-significance) to generate [model files](#significance-model-file) from documents. +The order in which the models are specified determines the model precedence, see [model resolution](#model-resolution) for details. + +In addition to adding models in [services.xml](reference/services-search.html#significance), +the `significance` feature must be enabled in the [`rank-profile` section of the schema](reference/schema-reference.html#significance), e.g. + +```xml +schema example { + document example { + field content type string { + indexing: index | summary + index: enable-bm25 + } + } + + rank-profile default { + significance { + use-model: true + } + } +} +``` + +The model will be applied to all query terms except those that already have significance values from the query. + +Specifying significance models in services.xml is available in Vespa as of version 8.426.8. + +### Significance model file + +The significance model file is a JSON file that contains term document frequencies and document count for one or more languages, e.g. + +```json +{ + "version": 1, + "id": "wikipedia", + "description": "Some optional description", + "languages": { + "en": { + "description": "Some optional description for English model", + "document-count": 1000, + "document-frequencies": { + "and": 500, + "car": 100, + ... + } + }, + "no": { + "description": "Some optional description for Norwegian model", + "document-count": 800, + "document-frequencies": { + "bil": 80, + "og": 400, + ... + } + } + } +} +``` + +A significance model file can be compressed with [zstandard](https://facebook.github.io/zstd/) +when included in the application package or made available via a URL. + +Vespa provides a [CLI tool for generating model files](operations-selfhosted/vespa-cmdline-tools.html#vespa-significance) from Vespa documents. +It is uses the same linguistic module as in query processing to extract tokens and their document frequencies. + +### Model resolution + +Model resolution selects a model from the models specified in [services.xml](#significance-models-in-servicesxml) based on the language of the query. +The language can be either [explicitly tagged](reference/query-api-reference.html#model.language) or [implicitly detected](linguistics.html#query-language-detection). + +The resolution logic is as follows: +- When language is explicitly tagged + - Select the last specified model that has the tagged language. + Fail if none are available. + - If the language is tagged as “un” (unknown), select the model for “un” first, fall back to “en” (English). + Fail if none are available. +- When language is implicitly detected + - Select the last specified model with the detected language. If not available, try “un” and then “en” languages. + Fail if none are available. diff --git a/en/streaming-search.html b/en/streaming-search.html index c86391b622..2ff7e13ada 100644 --- a/en/streaming-search.html +++ b/en/streaming-search.html @@ -48,9 +48,9 @@+ ++ ++ + + Differences in streaming search