Skip to content

Latest commit

 

History

History
201 lines (151 loc) · 10.3 KB

README.md

File metadata and controls

201 lines (151 loc) · 10.3 KB
#Vespa

Vespa sample applications - Simple hybrid search with SPLADE

This semantic search application combines BM25 with SPLADE for re-ranking. The sample app demonstrates the splade-embedder.

This sample application demonstrates using an apache 2.0 licenced splade model checkpoint prithivida/Splade_PP_en_v1.

The original SPLADE repo and model checkpoints have restrictive licenses:

There is a growing number of independent open-source sparse encoder checkpoints that are compatible with the Vespa splade embedder implementation:

See exporting fill-mask language models to onnx format.

Requires at least Vespa 8.320.68

To try this application

Follow Vespa getting started through the vespa deploy step, cloning splade instead of album-recommendation.

Indexing sample documents

$ vespa feed ext/*.json

Query examples

We demonstrate queries using the vespa-cli tool, please use -v to see the curl equivalent using the HTTP query API.

$ vespa query 'query=stars' \
 'input.query(q)=embed(splade,@query)' \
 'presentation.format.tensors=short-value'

Which will produce the following hit output

{
    "id": "id:doc:doc::3",
    "relevance": 32.3258056640625,
    "source": "text",
    "fields": {
        "matchfeatures": {
            "bm25(chunk)": 1.041708310095213,
            "bm25(title)": 0.9808292530117263,
            "query(q)": {
                "who": 1.1171875,
                "star": 2.828125,
                "stars": 2.875,
                "sky": 0.9375,
                "planet": 0.828125
            },
            "chunk_token_scores": {
                "star": 7.291259765625,
                "stars": 7.771484375,
                "planet": 0.7310791015625
            },
            "title_token_scores": {
                "star": 8.086669921875,
                "stars": 8.4453125
            }
        },
        "sddocname": "doc",
        "documentid": "id:doc:doc::3",
        "splade_chunk_embedding": {
            "the": 0.84375,
            "with": 1.3671875,
            "star": 2.578125,
            "stars": 2.703125,
            "filled": 2.171875,
            "planet": 0.8828125,
            "universe": 1.4296875,
            "fill": 2.03125,
            "filling": 1.5546875,
            "galaxy": 2.765625,
            "galaxies": 1.7265625
        },
        "splade_title_embedding": {
            "about": 1.984375,
            "star": 2.859375,
            "stars": 2.9375,
            "documents": 1.8671875,
            "starred": 0.81640625,
            "document": 2.671875,
            "concerning": 0.8671875
        },
        "title": "A document about stars",
        "chunk": "The galaxy is filled with stars"
    }
}

The rank-profile used here is default, specified in the schemas/doc.sd file.

It includes a match-features configuration specifying tensor and rank-features we want to return with each hit. We have:

  • bm25(title) the bm25 score of the query, title pair
  • bm25(chunk) the bm25 score of the query, chunk pair
  • query(q) - the splade query tensor produced by the embedder with all the tokens and their corresponding weight
  • splade_chunk_embedding - the mapped tensor produced by the embedder at indexing time (chunk)
  • splade_title_embedding - the mapped tensor produced by the embedder at indexing time (title)
  • chunk_token_scores - the non-zero overlap between the mapped query tensor and the mapped chunk tensor
  • title_token_scores - same as above, but for the title

The last two outputs allow us to highlight the terms of the source text for explainability.

Note that this application sets a high term-score-threshold to reduce the output verbosity. This setting controls which tokens are retained and used in the dot product calculation(s).

A higher threshold increases sparseness and reduces complexity and accuracy.

$ vespa query 'query=boats' \
 'input.query(q)=embed(splade,@query)' \
 'presentation.format.tensors=short-value'
$ vespa query 'query=humans talk a lot' \
 'input.query(q)=embed(splade,@query)' \
 'presentation.format.tensors=short-value'

Retrieval versus ranking

Note that in this sample application, Vespa is not using the expanded sparse learned weights for retrieval (matching).

It's used in a phased ranking pipeline where we retrieve efficiently using Vespa's weakAnd algorithm with BM25.

This phased ranking pipeline considerably speeds up retrieval compared to using the lexical expansion. It's also possible to retrieve/query using the wand vespa query operator. See an example in the documentation about using the wand.

We can also brute-force score and rank all documents that match a filter, this can also be accelerated by using multiple search threads per query.

vespa query 'yql=select * from doc where true' \
 'input.query(q)=embed(night sky of stars)' \
 'presentation.format.tensors=short-value'

For longer contexts using array inputs, see the tensor playground example for scoring options.

playground splade tensors in ranking

Exporting fill-mask models to onnx

To export a model trained with fill-mask (compatible with the splade-embedder):

$ pip3 install optimum onnx 

Export the model using the optimum-cli with task fill-mask:

$ optimum-cli export onnx --task fill-mask --model the-splade-model-id output

Remove the exported model files that are not needed by Vespa:

$ find models/ -type f ! -name 'model.onnx' ! -name 'tokenizer.json' | xargs rm

Terminate container

This is only relevant when running this sample application locally. Remove the container after use:

$ docker rm -f vespa