Skip to content

Commit

Permalink
Improve lucene-linguistics no-java app (#1402)
Browse files Browse the repository at this point in the history
* Add testing of non-java Lucene linguistic sample app

* remove tuning
  • Loading branch information
Jo Kristian Bergum authored Mar 8, 2024
1 parent ea537bc commit 3e0a71f
Show file tree
Hide file tree
Showing 9 changed files with 122 additions and 60 deletions.
44 changes: 22 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,23 +16,28 @@ For operational sample applications, see [examples/operations](examples/operatio
The [album-recommendation](album-recommendation/) is the intro application to Vespa.
Learn how to configure the schema for simple recommendation and search use cases.

### Simple semantic search
### Simple hybrid semantic search
The [simple semantic search](simple-semantic-search/)
application demonstrating indexed vector search using `HNSW`,
application demonstrates indexed vector search using `HNSW`,
creating embedding vectors from a transformer language model inside Vespa, and hybrid
text and semantic ranking. This app also demonstrates native embedders.
text and semantic ranking. This app also demonstrates using native Vespa embedders.

### Indexing multiple vectors per field
The [Vespa Multi-Vector Indexing with HNSW](multi-vector-indexing/) demonstrates how to
index multiple vectors per document field for better semantic search.
index multiple vectors per document field for better semantic search for longer documents.

### ColBERT multi-token level embeddings
The [colbert](colbert) demonstrates how to
### ColBERT token-level embeddings
The [colbert](colbert) application demonstrates how to
use the Vespa colbert-embedder for explainable semantic search with better accuracy than regular
text embedding models.

### ColBERT token-level embeddings for long documents
The [colbert-long](colbert-long) application demonstrates how to
use the Vespa colbert-embedder for explainable semantic search for longer documents.

### Multilingual semantic search
The [multilingual](multilingual-search) sample application demonstrates multilingual semantic search.
The [multilingual](multilingual-search) sample application demonstrates multilingual semantic search
with multilingual text embedding models.

### Customizing embeddings
The [custom-embeddings](custom-embeddings/) application demonstrates customizing frozen document embeddings for downstream tasks.
Expand All @@ -46,23 +51,21 @@ This application demonstrates basic search functionality.
It also demonstrates how to build a recommendation system
where the approximate nearest neighbor search in a shared user/item embedding space
is used to retrieve recommended content for a user.
This sample app also demonstrates using parent-child relationships.
This app also demonstrates using parent-child relationships.

### Billion-scale Image Search
This [billion-scale-image-search](billion-scale-image-search/) app demonstrates
billion-scale image search using CLIP retrieval. Features separation of compute from storage, and query time vector similarity de-duping. PCA dimension reduction and more.
billion-scale image search using CLIP retrieval. It features separation of compute from storage and query time vector similarity de-duping. PCA dimension reduction and more.

### State-of-the-art Text Ranking
This [msmarco-ranking](msmarco-ranking/) application demonstrates
how to represent state-of-the-art text ranking using Transformer (BERT) models.
It uses the MS Marco passage and document ranking datasets and features both
It uses the MS Marco passage ranking datasets and features
bi-encoders, cross-encoders, and late-interaction models (ColBERT).

The passage ranking part uses multiple state of the art pretrained language models
in a multiphase retrieval and ranking pipeline.
See also [Pretrained Transformer Models for Search](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-1/) blog post series.
There is also a simpler ranking app also using the MS Marco relevancy dataset.
See [text-search](text-search) which uses traditional IR text matching with BM25/Vespa nativeRank.

See also the more simplistic [text-search](text-search) app that demonstrates
traditional text search using BM25/Vespa nativeRank.

### Next generation E-Commerce Search

Expand All @@ -75,17 +78,14 @@ learning-to-rank techniques (Including `XGBoost` and `LightGBM`) for improving p

### Extractive Question Answering
The [dense-passage-retrieval-with-ann](dense-passage-retrieval-with-ann/) application
demonstrates end to end question answering using Facebook's DPR (Dense passage Retriever) for Extractive Question Answering. Extractive question answering, extracts
the answer from the evidence passage(s).

This sample app uses Vespa's approximate nearest neighbor search to efficiently retrieve text passages
from a Wikipedia-based collection of 21M passages. A BERT-based reader component reads the top-ranking passages and produces the textual answer to the question.
demonstrates end-to-end question answering using Facebook's DPR (Dense passage Retriever) model.
The extractive answering part extracts an answer from the evidence passage(s).

See also [Efficient Open Domain Question Answering with Vespa](https://blog.vespa.ai/efficient-open-domain-question-answering-on-vespa/)
and [Scaling Question Answering with Vespa](https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/).

### Search as you type and search suggestions
The [incremental-search](incremental-search/) application demonstrates search-as-you-type where for each keystroke of the user, retrieves matching documents.
### Search as you type and query suggestions
The [incremental-search](incremental-search/) application demonstrates search-as-you-type functionality, where for each keystroke of the user, it retrieves matching documents.
It also demonstrates search suggestions (query auto-completion).

### Vespa as ML inference server (model-inference)
Expand Down
69 changes: 49 additions & 20 deletions examples/lucene-linguistics/non-java/README.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,56 @@
# Lucene Linguistics in non-Java Vespa applications

In non-java projects it is possible to use Lucene Linguistics as a jar bundle.
<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->

Download and add the Vespa bundle jar into the `components` directory:
```shell
(mkdir -p components && cd components && curl -L https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases/download/v.0.0.3/lucene-linguistics-bundle-0.0.3-deploy.jar --output lucene-linguistics-bundle-0.0.3-deploy.jar)
```
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
<source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
<img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
</picture>

Deploy the application package:
```shell
vespa deploy -w 100
```
# Vespa sample applications - Lucene Linguistics

Run a query:
```shell
vespa query 'query=Vespa' 'language=lt'
```
This app demonstrates using [Lucene Linguistics](https://docs.vespa.ai/en/lucene-linguistics.html).

The logs should contain record:
```text
[2023-08-16 11:21:04.847] INFO container Container.com.yahoo.language.lucene.AnalyzerFactory Analyzer for language=lt is from a list of default language analyzers.
```

Profit.
<p data-test="run-macro init-deploy examples/lucene-linguistics/non-java">
Requires at least Vespa 8.315.19
</p>

## To try this application

Follow [Vespa getting started](https://cloud.vespa.ai/en/getting-started)
through the <code>vespa deploy</code> step, cloning `examples/lucene-linguistics/non-java` instead of `album-recommendation`.

Feed 3 sample documents in Norwegian, Swedish, and Finnish:

<pre data-test="exec">
vespa feed ext/*.json
</pre>

Example queries:

<pre data-test="exec" data-test-assert-contains="id:no:doc::1">
vespa query 'yql=select * from doc where userQuery()'\
'language=no' 'summary=debug-text-tokens' \
'query=tips til utendørsaktiviteter'
</pre>

<pre data-test="exec" data-test-assert-contains="id:sv:doc::1">
vespa query 'yql=select * from doc where userQuery()'\
'language=sv' 'summary=debug-text-tokens' \
'query=tips til utomhusaktiviteter'
</pre>

<pre data-test="exec" data-test-assert-contains="id:fi:doc::1">
vespa query 'yql=select * from doc where userQuery()'\
'language=fi' 'summary=debug-text-tokens' \
'query=vinkkejä ulkoilma-aktiviteetteihin'
</pre>

### Terminate container

Remove the container after use (Only relevant for local deployments)
<pre data-test="exec">
$ docker rm -f vespa
</pre>

The jar is hosted on [Github](https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases).
7 changes: 7 additions & 0 deletions examples/lucene-linguistics/non-java/ext/fi.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"put": "id:fi:doc::1",
"fields": {
"text": "Tervetuloa retkeilemään! Tässä oppaassa jaamme vinkkejä retkeilyreitin suunnitteluun ja valmistautumiseen. Olipa suunnitelmissasi päiväretki lähiluontoon tai pidempi vaellusreissu kansallispuistossa, löydät täältä tarvittavat tiedot ja neuvoja unohtumattoman retken järjestämiseksi.",
"language": "fi"
}
}
7 changes: 7 additions & 0 deletions examples/lucene-linguistics/non-java/ext/no.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"put": "id:no:doc::1",
"fields": {
"text": "Velkommen til naturopplevelser! I denne guiden deler vi tips om planlegging og forberedelser til utendørsaktiviteter. Enten du planlegger en dagstur i nærområdet eller en lengre fjelltur i nasjonalparken, finner du her nødvendig informasjon og råd for å arrangere en minneverdig tur.",
"language": "no"
}
}
7 changes: 7 additions & 0 deletions examples/lucene-linguistics/non-java/ext/sv.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"put": "id:sv:doc::1",
"fields": {
"text": "Välkommen till naturäventyr! I den här guiden delar vi tips om planering och förberedelser inför utomhusaktiviteter. Oavsett om du planerar en dagsutflykt i närområdet eller en längre vandringsresa i nationalparken, hittar du här nödvändig information och råd för att arrangera en minnesvärd tur.",
"language": "sv"
}
}
27 changes: 27 additions & 0 deletions examples/lucene-linguistics/non-java/schemas/doc.sd
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
schema doc {

document doc {
field language type string {
indexing: set_language | summary | index
match: word
}
field text type string {
indexing: summary | index
index: enable-bm25
}
}

fieldset default {
fields: text
}
document-summary debug-text-tokens {
summary documentid {}
summary language {}
summary text {}
summary text_tokens {
source: text
tokens
}
from-disk
}
}
15 changes: 0 additions & 15 deletions examples/lucene-linguistics/non-java/schemas/lucene.sd

This file was deleted.

5 changes: 2 additions & 3 deletions examples/lucene-linguistics/non-java/services.xml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
<container id="container" version="1.0">
<component id="linguistics"
class="com.yahoo.language.lucene.LuceneLinguistics"
bundle="lucene-linguistics-bundle">
bundle="lucene-linguistics">
<config name="com.yahoo.language.lucene.lucene-analysis"/>
</component>
<document-processing/>
Expand All @@ -13,8 +13,7 @@
<content id="content" version="1.0">
<min-redundancy>1</min-redundancy>
<documents>
<document type="lucene" mode="index"/>
<document-processing cluster="container"/>
<document type="doc" mode="index"/>
</documents>
</content>
</services>
1 change: 1 addition & 0 deletions test/_test_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ urls:
- colbert/README.md
- text-image-search/README.md
- text-search/README.md
- examples/lucene-linguistics/non-java//README.md
- examples/document-processing/README.md
- examples/predicate-fields/README.md
- examples/operations/multinode/README.md
Expand Down

0 comments on commit 3e0a71f

Please sign in to comment.