Merge pull request #1264 from dainiusjocas/demo-lucene-linguistics

Sample app with the LuceneLinguistics
vespa-engine · Sep 20, 2023 · c2b2b2d · c2b2b2d
2 parents c651e9d + 3d5d127
commit c2b2b2d
Show file tree

Hide file tree

Showing 43 changed files with 1,054 additions and 0 deletions.
diff --git a/examples/README.md b/examples/README.md
@@ -43,6 +43,11 @@ Generic [request-response](https://docs.vespa.ai/en/jdisc/processing.html) proce
 <!-- ToDo: FIXME -->
 
 
+### Lucene Linguistics
+The [lucene-linguistics](lucene-linguistics) contains two sample application packages:
+1. A bare minimal app.
+2. Shows advanced configuration of the Lucene based `Linguistics` implementation.
+
 ----
 
 Note: Applications with _pom.xml_ are Java/Maven projects and must be built before being deployed.

diff --git a/examples/lucene-linguistics/README.md b/examples/lucene-linguistics/README.md
@@ -0,0 +1,195 @@
+<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
+
+![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png)
+
+# Vespa LuceneLinguistics Demos
+
+A couple of example of how to get started with the `lucene-linguistics`:
+
+- `non-java`: an absolute minimum to get started;
+- `minimal`: minimal Java based project using Lucene Linguistics;
+- `advanced-configuration`: demonstrates the configurability;
+- `going-crazy`: demonstrates the advanced setup;
+
+## Getting started
+
+For all application packages the procedure is the same:
+go to the application package directory and play with the following commands:
+
+```shell
+# Of course make sure that your Docker daemon is running
+# make sure that Vespa CLI is installed
+brew install vespa-cli
+# Maven must be 3.6+
+brew install maven
+
+docker run --rm --detach \
+  --name vespa \
+  --hostname vespa-container \
+  --publish 8080:8080 \
+  --publish 19071:19071 \
+  --publish 19050:19050 \
+  vespaengine/vespa:8.224.19
+
+# To observe the logs from LuceneLinguistics run in a separate terminal
+docker logs  vespa -f | grep -i "lucene"
+
+vespa status deploy --wait 300
+
+(mvn clean package && vespa deploy -w 100)
+
+vespa feed src/main/application/ext/document.json
+vespa query 'yql=select * from lucene where default contains "dogs"' \
+  'model.locale=en'
+
+# after this query log entry like this should appear:
+[2023-08-02 19:57:12.106] INFO    container        Container.com.yahoo.language.lucene.AnalyzerFactory	Analyzer for language=en is from a list of default language analyzers.
+```
+
+The query should return:
+```json
+{
+  "root": {
+    "id": "toplevel",
+    "relevance": 1.0,
+    "fields": {
+      "totalCount": 1
+    },
+    "coverage": {
+      "coverage": 100,
+      "documents": 1,
+      "full": true,
+      "nodes": 1,
+      "results": 1,
+      "resultsFull": 1
+    },
+    "children": [
+      {
+        "id": "id:mynamespace:lucene::mydocid",
+        "relevance": 0.16343879032006287,
+        "source": "content",
+        "fields": {
+          "sddocname": "lucene",
+          "documentid": "id:mynamespace:lucene::mydocid",
+          "mytext": "Cats and Dogs"
+        }
+      }
+    ]
+  }
+}
+```
+
+### Observing query rewrites
+
+```shell
+vespa query 'yql=select * from lucene where default contains "dogs"' \
+  'model.locale=en' \
+  'trace.level=2' | jq '.trace.children | last | .children[] | select(.message) | select(.message | test("YQL.*")) | .message'
+```
+Output
+```shell
+"YQL+ query parsed: [select * from lucene where default contains \"dog\" timeout 10000]"
+```
+See that the `dogs` rewritten as `dog`.
+
+Change the `model.locale` to other language, change the query, and observe the analysis differences.
+
+### Observing the indexed tokens
+
+It is possible to explore the tokens directly in the index.
+To do that you can run these commands **inside** the running Vespa Docker container.
+
+```shell
+# Into the Vespa docker
+docker exec -it vespa bash
+# Trigger the flushing to the disk
+vespa-proton-cmd --local triggerFlush
+
+# Show the posting lists
+vespa-index-inspect showpostings \
+          --indexdir  /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \
+          --field mytext --transpose
+# =>
+# docId = 1
+# field = 0 "mytext"
+#  element = 0, elementLen = 2, elementWeight = 1
+#   pos = 0, word = "cat"
+#   pos = 1, word = "dog"
+
+# Show the tokens
+vespa-index-inspect dumpwords \
+          --indexdir  /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \
+          --wordnum \
+          --field mytext
+# =>
+# 1	cat	1
+# 2	dog	1
+```
+
+Have fun!
+
+## Common Issues
+
+The `lucene-linguistics` component is highly configurable.
+It has an optional `configDir` configuration parameter of type `path`.
+`configDir` is a directory to store linguistics resources, e.g. dictionaries with stopwords, etc., and is relative to the VAP root directory.
+
+There are several known problems that might happen when `configDir` is misconfigured.
+
+### `configDir` is specified but doesn't exist
+
+If the `configDir` doesn't exist then `vespa deploy` would fail with such error:
+
+```shell
+Uploading application package ... failed
+Error: invalid application package (400 Bad Request)
+Invalid application:
+Unable to send file specified in com.yahoo.language.lucene.lucene-analysis:
+/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/4/lucene (No such file or directory)
+```
+
+### Empty directory can't be referred
+
+If the `configDir` is set with `foo` which is empty then during deployment you get a misleading error message:
+```shell
+Uploading application package ... failed
+Error: invalid application package (400 Bad Request)
+Invalid application:
+Unable to send file specified in com.yahoo.language.lucene.lucene-analysis:
+/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/8/foo (No such file or directory)
+```
+
+### Application package root cannot be used as `configDir`
+
+If you try to be clever and set `<configDir>.</configDir>` then application package would be deployed(!) BUT
+not converge with the following error:
+```shell
+Uploading application package ... done
+
+Success: Deployed target/application.zip
+WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple]
+
+Waiting up to 1m40s for query service to become available ...
+Error: service 'query' is unavailable: services have not converged
+```
+
+And Vespa logs would be filled with such warnings:
+```shell
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	Exception in thread "Rpc executorpool-6-thread-5" java.lang.RuntimeException: More than one file reference found for file 'fbcf5c3dc81d9540'
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:109)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:100)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.filedistribution.FileDownloader.getFutureFile(FileDownloader.java:80)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.filedistribution.FileDownloader.getFile(FileDownloader.java:70)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.downloadFile(FileDistributionRpcServer.java:109)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.lambda$getFile$0(FileDistributionRpcServer.java:84)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
+[2023-08-02 20:30:47.675] WARNING configproxy      stderr	\tat java.base/java.lang.Thread.run(Thread.java:833)
+```
+
+### Harmless warning
+`vespa deploy` always warns with:
+```shell
+WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple]
+```
+You can ignore this warning.
diff --git a/examples/lucene-linguistics/advanced-configuration/README.md b/examples/lucene-linguistics/advanced-configuration/README.md
@@ -0,0 +1,66 @@
+# Vespa Lucene Linguistics
+
+This Vespa application package (VAP) previews the configuration options of the `lucene-linguistics` package.
+Probably the main benefit of the `LuceneLinguistics` is the configurability when compared to other `Linguistics` implementations.
+
+## Custom Lucene Analyzers
+
+There are multiple ways to use a Lucene `Analyzer` for a language.
+Each analyzer is identified by a language key, e.g. 'en' for English language. 
+These are Analyzer types in the order of descending priority:
+1. Created through the `Linguistics` component configuration.
+2. An `Analyzer` wrapped into a Vespa `<component>`.
+3. A list of [default Analyzers](https://github.com/vespa-engine/vespa/blob/5d26801bc63c35705e708d3cc7086f0b0103e909/lucene-linguistics/src/main/java/com/yahoo/language/lucene/DefaultAnalyzers.java) per language.
+4. The `StandardAnalyzer`.
+
+### Add a Lucene Analyzer component
+
+Vespa provides a `ComponentRegistry` mechanism.
+The `LuceneLinguistics` accepts a `ComponentRegistry<Analyzer>` into the constructor.
+Basically, the Vespa container at start time collects all the components that are of the `Analyzer` type automagically.
+
+To declare such components:
+```xml
+<component id="en"
+           class="org.apache.lucene.analysis.core.SimpleAnalyzer"
+           bundle="vespa-lucene-linguistics-poc" />
+```
+Where:
+- `id` should contain a language code.
+- `class` should be the implementing class.
+Note that it is a class straight from the Lucene library.
+Also, you can create an `Analyzer` class just inside your VAP and refer it.
+- `bundle` must be your application package `artifactId` as specified in the `pom.xml`.
+
+Here are two types of `Analyzer` components:
+1. That doesn't require any setup.
+2. That requires a setup (e.g. constructor with arguments).
+
+The previous component declaration example is of type (1).
+
+The (2) type requires a bit more work.
+
+Create a class (e.g. for the Polish language):
+```java
+package ai.vespa.linguistics.pl;
+
+import com.yahoo.container.di.componentgraph.Provider;
+import org.apache.lucene.analysis.Analyzer;
+
+public class PolishAnalyzer implements Provider<Analyzer> {
+    @Override
+    public Analyzer get() {
+        return new org.apache.lucene.analysis.pl.PolishAnalyzer();
+    }
+    @Override
+    public void deconstruct() {}
+}
+```
+
+Add a component declaration into the `services.xml` file:
+```xml
+<component id="pl"
+           class="ai.vespa.linguistics.pl.PolishAnalyzer"
+           bundle="vespa-lucene-linguistics-poc" />
+```
+And now you have the handling of the Polish language available.
diff --git a/examples/lucene-linguistics/advanced-configuration/pom.xml b/examples/lucene-linguistics/advanced-configuration/pom.xml
@@ -0,0 +1,85 @@
+<?xml version="1.0"?>
+<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
+<project xmlns="http://maven.apache.org/POM/4.0.0"
+         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
+                             http://maven.apache.org/xsd/maven-4.0.0.xsd">
+  <modelVersion>4.0.0</modelVersion>
+  <groupId>ai.vespa</groupId>
+  <artifactId>vespa-lucene-linguistics-poc</artifactId>
+  <version>0.0.2</version>
+  <packaging>container-plugin</packaging>
+  <properties>
+    <bundle-plugin.failOnWarnings>false</bundle-plugin.failOnWarnings>
+    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
+    <test.hide>true</test.hide>
+    <vespa.version>8.227.41</vespa.version>
+    <junit.version>5.7.1</junit.version>
+  </properties>
+
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.lucene</groupId>
+      <artifactId>lucene-analysis-stempel</artifactId>
+      <version>9.7.0</version>
+    </dependency>
+    <dependency>
+      <groupId>com.yahoo.vespa</groupId>
+      <artifactId>lucene-linguistics</artifactId>
+      <version>${vespa.version}</version>
+    </dependency>
+    <dependency>
+      <groupId>com.yahoo.vespa</groupId>
+      <artifactId>linguistics</artifactId>
+      <version>${vespa.version}</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>com.yahoo.vespa</groupId>
+      <artifactId>application</artifactId>
+      <version>${vespa.version}</version>
+      <scope>provided</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.junit.jupiter</groupId>
+      <artifactId>junit-jupiter</artifactId>
+      <version>${junit.version}</version>
+      <scope>test</scope>
+    </dependency>
+  </dependencies>
+
+  <build>
+    <plugins>
+      <plugin>
+        <groupId>com.yahoo.vespa</groupId>
+        <artifactId>bundle-plugin</artifactId>
+        <version>${vespa.version}</version>
+        <extensions>true</extensions>
+        <configuration>
+          <failOnWarnings>false</failOnWarnings>
+        </configuration>
+      </plugin>
+      <plugin>
+        <groupId>com.yahoo.vespa</groupId>
+        <artifactId>vespa-application-maven-plugin</artifactId>
+        <version>${vespa.version}</version>
+        <executions>
+          <execution>
+            <goals>
+              <goal>packageApplication</goal>
+            </goals>
+          </execution>
+        </executions>
+      </plugin>
+      <plugin>
+        <groupId>org.apache.maven.plugins</groupId>
+        <artifactId>maven-compiler-plugin</artifactId>
+        <version>3.11.0</version>
+        <configuration>
+          <source>17</source>
+          <target>17</target>
+        </configuration>
+      </plugin>
+    </plugins>
+  </build>
+</project>
diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json b/examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json
@@ -0,0 +1,7 @@
+{
+  "put": "id:mynamespace:lucene::mydocid",
+  "fields": {
+    "language": "en",
+    "mytext": "Cats and Dogs"
+  }
+}
diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd b/examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd
@@ -0,0 +1,15 @@
+schema lucene {
+
+    document lucene {
+        field language type string {
+            indexing: set_language
+        }
+        field mytext type string {
+            indexing: summary | index
+        }
+    }
+
+    fieldset default {
+        fields: mytext
+    }
+}