Uses N-grams to simulate substring search.
Requirements:
- Docker Desktop installed and running. 4GB available memory for Docker is recommended. Refer to Docker memory for details and troubleshooting
- Alternatively, deploy using Vespa Cloud
- Operating system: Linux, macOS or Windows 10 Pro (Docker requirement)
- Architecture: x86_64 or arm64
- Homebrew to install Vespa CLI, or download a vespa cli release from GitHub releases.
- Java 17 installed.
- Apache Maven This sample app uses custom Java components and Maven is used to build the application.
Validate environment, must be minimum 4GB:
$ docker info | grep "Total Memory" or $ podman info | grep "memTotal"
Install Vespa CLI:
$ brew install vespa-cli
For local deployment using docker image:
$ vespa config set target local
Pull and start the vespa docker container image:
$ docker pull vespaengine/vespa $ docker run --detach --name vespa --hostname vespa-container \ --publish 127.0.0.1:8080:8080 --publish 127.0.0.1:19071:19071 \ vespaengine/vespa
Download this sample application:
$ vespa clone incremental-search/search-as-you-type myapp && cd myapp
Build the application package:
$ mvn clean package -U
Download feed file:
$ curl -L -o search-as-you-type-index.jsonl \ https://data.vespa-cloud.com/sample-apps-data/search-as-you-type-index.jsonl
Verify that configuration service (deploy api) is ready:
$ vespa status deploy --wait 300
Deploy the application:
$ vespa deploy --wait 300
It is possible to deploy this app to Vespa Cloud.
Wait for the application endpoint to become available:
$ vespa status --wait 300
Running Vespa System Tests which runs a set of basic tests to verify that the application is working as expected:
$ vespa test src/test/application/tests/system-test/search-as-you-type-test.json
Feed documents:
$ while read -r line; do echo $line > tmp.json; vespa document tmp.json; done < search-as-you-type-index.jsonl
$ vespa query \ 'yql=select * from doc where ([{"defaultIndex":"grams"}]userInput(@query))'\ 'hits=10' \ 'query=xgb'
Check out the website - open http://localhost:8080/site/ in a browser:
$ curl -s http://localhost:8080/site/
Shutdown and remove the Docker container:
$ docker rm -f vespa
Substring searches are slow when working on large amounts of data. However, an N-gram search can be used as a faster but less precise substring-like search. The fields title and content are re-indexed to create the fields gram_title and gram_content with an N-gram index. In this example the gram size is set to 3, but any value can be used. A lower gram size will get more hits, but may also find more irrelevant hits.
If we can get a hit on a whole word, this is most likely a more relevant hit than a hit on only a part of a word. Therefore, we search through both the default fieldset and the grams fieldset, and we weight hits on the default fieldset higher than other hits. These weights can be seen in the weighted_doc_rank rank profile.
Text highlights are generated by including summary: dynamic
in a field.
As searches on default and grams match with different parts of the text,
the highlights of these matches will also be different.
The line contentScore*highlightWeight >= gramContentScore
in
src/main/resources/site/js/main.js
decides which of these highlights should be shown on the website.
The variable highlightWeight can be tweaked to prioritize default highlighting or grams highlighting.