Improve lucene-linguistics no-java app (#1402)

* Add testing of non-java Lucene linguistic sample app * remove tuning
vespa-engine · Mar 8, 2024 · 3e0a71f · 3e0a71f
1 parent ea537bc
commit 3e0a71f
Show file tree

Hide file tree

Showing 9 changed files with 122 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -16,23 +16,28 @@ For operational sample applications, see [examples/operations](examples/operatio
 The [album-recommendation](album-recommendation/) is the intro application to Vespa.
 Learn how to configure the schema for simple recommendation and search use cases.
 
-### Simple semantic search
+### Simple hybrid semantic search
 The [simple semantic search](simple-semantic-search/)
-application demonstrating indexed vector search using `HNSW`, 
+application demonstrates indexed vector search using `HNSW`, 
 creating embedding vectors from a transformer language model inside Vespa, and hybrid
-text and semantic ranking. This app also demonstrates native embedders. 
+text and semantic ranking. This app also demonstrates using native Vespa embedders. 
 
 ### Indexing multiple vectors per field
 The [Vespa Multi-Vector Indexing with HNSW](multi-vector-indexing/) demonstrates how to 
-index multiple vectors per document field for better semantic search. 
+index multiple vectors per document field for better semantic search for longer documents.  
 
-### ColBERT multi-token level embeddings
-The [colbert](colbert) demonstrates how to 
+### ColBERT token-level embeddings
+The [colbert](colbert) application demonstrates how to 
 use the Vespa colbert-embedder for explainable semantic search with better accuracy than regular
 text embedding models. 
 
+### ColBERT token-level embeddings for long documents
+The [colbert-long](colbert-long) application demonstrates how to 
+use the Vespa colbert-embedder for explainable semantic search for longer documents. 
+
 ### Multilingual semantic search
-The [multilingual](multilingual-search) sample application demonstrates multilingual semantic search. 
+The [multilingual](multilingual-search) sample application demonstrates multilingual semantic search 
+with multilingual text embedding models. 
 
 ### Customizing embeddings 
 The [custom-embeddings](custom-embeddings/) application demonstrates customizing frozen document embeddings for downstream tasks. 
@@ -46,23 +51,21 @@ This application demonstrates basic search functionality.
 It also demonstrates how to build a recommendation system
 where the approximate nearest neighbor search in a shared user/item embedding space
 is used to retrieve recommended content for a user.
-This sample app also demonstrates using parent-child relationships.
+This app also demonstrates using parent-child relationships.
 
 ### Billion-scale Image Search
 This [billion-scale-image-search](billion-scale-image-search/) app demonstrates 
-billion-scale image search using CLIP retrieval. Features separation of compute from storage, and query time vector similarity de-duping. PCA dimension reduction and more.
+billion-scale image search using CLIP retrieval. It features separation of compute from storage and query time vector similarity de-duping. PCA dimension reduction and more.
 
 ### State-of-the-art Text Ranking
 This [msmarco-ranking](msmarco-ranking/) application demonstrates 
 how to represent state-of-the-art text ranking using Transformer (BERT) models.
-It uses the MS Marco passage and document ranking datasets and features both
+It uses the MS Marco passage ranking datasets and features
 bi-encoders, cross-encoders, and late-interaction models (ColBERT).
 
-The passage ranking part uses multiple state of the art pretrained language models
-in a multiphase retrieval and ranking pipeline.
-See also [Pretrained Transformer Models for Search](https://blog.vespa.ai/pretrained-transformer-language-models-for-search-part-1/) blog post series.
-There is also a simpler ranking app also using the MS Marco relevancy dataset.
-See [text-search](text-search) which uses traditional IR text matching with BM25/Vespa nativeRank.
+
+See also the more simplistic [text-search](text-search) app that demonstrates 
+traditional text search using BM25/Vespa nativeRank.
 
 ### Next generation E-Commerce Search
 
@@ -75,17 +78,14 @@ learning-to-rank techniques (Including `XGBoost` and `LightGBM`) for improving p
 
 ### Extractive Question Answering
 The [dense-passage-retrieval-with-ann](dense-passage-retrieval-with-ann/) application
-demonstrates end to end question answering using Facebook's DPR (Dense passage Retriever) for Extractive Question Answering. Extractive question answering, extracts
-the answer from the evidence passage(s).
-
-This sample app uses Vespa's approximate nearest neighbor search to efficiently retrieve text passages
-from a Wikipedia-based collection of 21M passages. A BERT-based reader component reads the top-ranking passages and produces the textual answer to the question.
+demonstrates end-to-end question answering using Facebook's DPR (Dense passage Retriever) model. 
+The extractive answering part extracts an answer from the evidence passage(s).
 
 See also [Efficient Open Domain Question Answering with Vespa](https://blog.vespa.ai/efficient-open-domain-question-answering-on-vespa/)
 and [Scaling Question Answering with Vespa](https://blog.vespa.ai/from-research-to-production-scaling-a-state-of-the-art-machine-learning-system/).
 
-### Search as you type and search suggestions 
-The [incremental-search](incremental-search/) application demonstrates search-as-you-type where for each keystroke of the user, retrieves matching documents. 
+### Search as you type and query suggestions 
+The [incremental-search](incremental-search/) application demonstrates search-as-you-type functionality, where for each keystroke of the user, it retrieves matching documents. 
 It also demonstrates search suggestions (query auto-completion).
 
 ### Vespa as ML inference server (model-inference)

diff --git a/examples/lucene-linguistics/non-java/README.md b/examples/lucene-linguistics/non-java/README.md
@@ -1,27 +1,56 @@
-# Lucene Linguistics in non-Java Vespa applications
 
-In non-java projects it is possible to use Lucene Linguistics as a jar bundle.
+<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
 
-Download and add the Vespa bundle jar into the `components` directory:
-```shell
-(mkdir -p components && cd components && curl -L https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases/download/v.0.0.3/lucene-linguistics-bundle-0.0.3-deploy.jar --output lucene-linguistics-bundle-0.0.3-deploy.jar)
-```
+<picture>
+  <source media="(prefers-color-scheme: dark)" srcset="https://vespa.ai/assets/vespa-ai-logo-heather.svg">
+  <source media="(prefers-color-scheme: light)" srcset="https://vespa.ai/assets/vespa-ai-logo-rock.svg">
+  <img alt="#Vespa" width="200" src="https://vespa.ai/assets/vespa-ai-logo-rock.svg" style="margin-bottom: 25px;">
+</picture>
 
-Deploy the application package:
-```shell
-vespa deploy -w 100
-```
+# Vespa sample applications - Lucene Linguistics 
 
-Run a query:
-```shell
-vespa query 'query=Vespa' 'language=lt'
-```
+This app demonstrates using [Lucene Linguistics](https://docs.vespa.ai/en/lucene-linguistics.html).
 
-The logs should contain record:
-```text
-[2023-08-16 11:21:04.847] INFO    container        Container.com.yahoo.language.lucene.AnalyzerFactory	Analyzer for language=lt is from a list of default language analyzers.
-```
 
-Profit.
+<p data-test="run-macro init-deploy examples/lucene-linguistics/non-java">
+Requires at least Vespa 8.315.19
+</p>
+
+## To try this application
+
+Follow [Vespa getting started](https://cloud.vespa.ai/en/getting-started)
+through the <code>vespa deploy</code> step, cloning `examples/lucene-linguistics/non-java` instead of `album-recommendation`.
+
+Feed 3 sample documents in Norwegian, Swedish, and Finnish: 
+
+<pre data-test="exec">
+vespa feed ext/*.json
+</pre>
+
+Example queries:
+
+<pre data-test="exec" data-test-assert-contains="id:no:doc::1">
+vespa query 'yql=select * from doc where userQuery()'\
+ 'language=no' 'summary=debug-text-tokens' \
+ 'query=tips til utendørsaktiviteter'
+</pre>
+
+<pre data-test="exec" data-test-assert-contains="id:sv:doc::1">
+vespa query 'yql=select * from doc where userQuery()'\
+ 'language=sv' 'summary=debug-text-tokens' \
+ 'query=tips til utomhusaktiviteter'
+</pre>
+
+<pre data-test="exec" data-test-assert-contains="id:fi:doc::1">
+vespa query 'yql=select * from doc where userQuery()'\
+ 'language=fi' 'summary=debug-text-tokens' \
+ 'query=vinkkejä ulkoilma-aktiviteetteihin'
+</pre>
+
+### Terminate container 
+
+Remove the container after use (Only relevant for local deployments)
+<pre data-test="exec">
+$ docker rm -f vespa
+</pre>
 
-The jar is hosted on [Github](https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases).
diff --git a/examples/lucene-linguistics/non-java/ext/fi.json b/examples/lucene-linguistics/non-java/ext/fi.json
@@ -0,0 +1,7 @@
+{
+	"put": "id:fi:doc::1", 
+	"fields": {
+		"text": "Tervetuloa retkeilemään! Tässä oppaassa jaamme vinkkejä retkeilyreitin suunnitteluun ja valmistautumiseen. Olipa suunnitelmissasi päiväretki lähiluontoon tai pidempi vaellusreissu kansallispuistossa, löydät täältä tarvittavat tiedot ja neuvoja unohtumattoman retken järjestämiseksi.",
+		"language": "fi"
+	}
+}
diff --git a/examples/lucene-linguistics/non-java/ext/no.json b/examples/lucene-linguistics/non-java/ext/no.json
@@ -0,0 +1,7 @@
+{
+	"put": "id:no:doc::1", 
+	"fields": {
+		"text": "Velkommen til naturopplevelser! I denne guiden deler vi tips om planlegging og forberedelser til utendørsaktiviteter. Enten du planlegger en dagstur i nærområdet eller en lengre fjelltur i nasjonalparken, finner du her nødvendig informasjon og råd for å arrangere en minneverdig tur.",
+		"language": "no"
+	}
+}
diff --git a/examples/lucene-linguistics/non-java/ext/sv.json b/examples/lucene-linguistics/non-java/ext/sv.json
@@ -0,0 +1,7 @@
+{
+	"put": "id:sv:doc::1", 
+	"fields": {
+		"text": "Välkommen till naturäventyr! I den här guiden delar vi tips om planering och förberedelser inför utomhusaktiviteter. Oavsett om du planerar en dagsutflykt i närområdet eller en längre vandringsresa i nationalparken, hittar du här nödvändig information och råd för att arrangera en minnesvärd tur.",
+		"language": "sv"
+	}
+}
diff --git a/examples/lucene-linguistics/non-java/schemas/doc.sd b/examples/lucene-linguistics/non-java/schemas/doc.sd
@@ -0,0 +1,27 @@
+schema doc {
+
+    document doc {
+        field language type string {
+            indexing: set_language | summary | index
+            match: word
+        }
+        field text type string {
+            indexing: summary | index
+            index: enable-bm25
+        }
+    }
+
+    fieldset default {
+        fields: text
+    }
+    document-summary debug-text-tokens {
+        summary documentid {}
+        summary language {}
+        summary text {}
+        summary text_tokens {
+            source: text
+            tokens
+        }
+        from-disk
+    }
+}
diff --git a/examples/lucene-linguistics/non-java/schemas/lucene.sd b/examples/lucene-linguistics/non-java/schemas/lucene.sd
diff --git a/examples/lucene-linguistics/non-java/services.xml b/examples/lucene-linguistics/non-java/services.xml
@@ -3,7 +3,7 @@
   <container id="container" version="1.0">
     <component id="linguistics"
                class="com.yahoo.language.lucene.LuceneLinguistics"
-               bundle="lucene-linguistics-bundle">
+               bundle="lucene-linguistics">
       <config name="com.yahoo.language.lucene.lucene-analysis"/>
     </component>
     <document-processing/>
@@ -13,8 +13,7 @@
   <content id="content" version="1.0">
     <min-redundancy>1</min-redundancy>
     <documents>
-      <document type="lucene" mode="index"/>
-      <document-processing cluster="container"/>
+      <document type="doc" mode="index"/>
     </documents>
   </content>
 </services>
diff --git a/test/_test_config.yml b/test/_test_config.yml
@@ -10,6 +10,7 @@ urls:
     - colbert/README.md
     - text-image-search/README.md
     - text-search/README.md
+    - examples/lucene-linguistics/non-java//README.md
     - examples/document-processing/README.md
     - examples/predicate-fields/README.md
     - examples/operations/multinode/README.md