-
Notifications
You must be signed in to change notification settings - Fork 111
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #1264 from dainiusjocas/demo-lucene-linguistics
Sample app with the LuceneLinguistics
- Loading branch information
Showing
43 changed files
with
1,054 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,195 @@ | ||
<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> | ||
|
||
![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png) | ||
|
||
# Vespa LuceneLinguistics Demos | ||
|
||
A couple of example of how to get started with the `lucene-linguistics`: | ||
|
||
- `non-java`: an absolute minimum to get started; | ||
- `minimal`: minimal Java based project using Lucene Linguistics; | ||
- `advanced-configuration`: demonstrates the configurability; | ||
- `going-crazy`: demonstrates the advanced setup; | ||
|
||
## Getting started | ||
|
||
For all application packages the procedure is the same: | ||
go to the application package directory and play with the following commands: | ||
|
||
```shell | ||
# Of course make sure that your Docker daemon is running | ||
# make sure that Vespa CLI is installed | ||
brew install vespa-cli | ||
# Maven must be 3.6+ | ||
brew install maven | ||
|
||
docker run --rm --detach \ | ||
--name vespa \ | ||
--hostname vespa-container \ | ||
--publish 8080:8080 \ | ||
--publish 19071:19071 \ | ||
--publish 19050:19050 \ | ||
vespaengine/vespa:8.224.19 | ||
|
||
# To observe the logs from LuceneLinguistics run in a separate terminal | ||
docker logs vespa -f | grep -i "lucene" | ||
|
||
vespa status deploy --wait 300 | ||
|
||
(mvn clean package && vespa deploy -w 100) | ||
|
||
vespa feed src/main/application/ext/document.json | ||
vespa query 'yql=select * from lucene where default contains "dogs"' \ | ||
'model.locale=en' | ||
|
||
# after this query log entry like this should appear: | ||
[2023-08-02 19:57:12.106] INFO container Container.com.yahoo.language.lucene.AnalyzerFactory Analyzer for language=en is from a list of default language analyzers. | ||
``` | ||
|
||
The query should return: | ||
```json | ||
{ | ||
"root": { | ||
"id": "toplevel", | ||
"relevance": 1.0, | ||
"fields": { | ||
"totalCount": 1 | ||
}, | ||
"coverage": { | ||
"coverage": 100, | ||
"documents": 1, | ||
"full": true, | ||
"nodes": 1, | ||
"results": 1, | ||
"resultsFull": 1 | ||
}, | ||
"children": [ | ||
{ | ||
"id": "id:mynamespace:lucene::mydocid", | ||
"relevance": 0.16343879032006287, | ||
"source": "content", | ||
"fields": { | ||
"sddocname": "lucene", | ||
"documentid": "id:mynamespace:lucene::mydocid", | ||
"mytext": "Cats and Dogs" | ||
} | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
|
||
### Observing query rewrites | ||
|
||
```shell | ||
vespa query 'yql=select * from lucene where default contains "dogs"' \ | ||
'model.locale=en' \ | ||
'trace.level=2' | jq '.trace.children | last | .children[] | select(.message) | select(.message | test("YQL.*")) | .message' | ||
``` | ||
Output | ||
```shell | ||
"YQL+ query parsed: [select * from lucene where default contains \"dog\" timeout 10000]" | ||
``` | ||
See that the `dogs` rewritten as `dog`. | ||
|
||
Change the `model.locale` to other language, change the query, and observe the analysis differences. | ||
|
||
### Observing the indexed tokens | ||
|
||
It is possible to explore the tokens directly in the index. | ||
To do that you can run these commands **inside** the running Vespa Docker container. | ||
|
||
```shell | ||
# Into the Vespa docker | ||
docker exec -it vespa bash | ||
# Trigger the flushing to the disk | ||
vespa-proton-cmd --local triggerFlush | ||
|
||
# Show the posting lists | ||
vespa-index-inspect showpostings \ | ||
--indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \ | ||
--field mytext --transpose | ||
# => | ||
# docId = 1 | ||
# field = 0 "mytext" | ||
# element = 0, elementLen = 2, elementWeight = 1 | ||
# pos = 0, word = "cat" | ||
# pos = 1, word = "dog" | ||
|
||
# Show the tokens | ||
vespa-index-inspect dumpwords \ | ||
--indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \ | ||
--wordnum \ | ||
--field mytext | ||
# => | ||
# 1 cat 1 | ||
# 2 dog 1 | ||
``` | ||
|
||
Have fun! | ||
|
||
## Common Issues | ||
|
||
The `lucene-linguistics` component is highly configurable. | ||
It has an optional `configDir` configuration parameter of type `path`. | ||
`configDir` is a directory to store linguistics resources, e.g. dictionaries with stopwords, etc., and is relative to the VAP root directory. | ||
|
||
There are several known problems that might happen when `configDir` is misconfigured. | ||
|
||
### `configDir` is specified but doesn't exist | ||
|
||
If the `configDir` doesn't exist then `vespa deploy` would fail with such error: | ||
|
||
```shell | ||
Uploading application package ... failed | ||
Error: invalid application package (400 Bad Request) | ||
Invalid application: | ||
Unable to send file specified in com.yahoo.language.lucene.lucene-analysis: | ||
/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/4/lucene (No such file or directory) | ||
``` | ||
|
||
### Empty directory can't be referred | ||
|
||
If the `configDir` is set with `foo` which is empty then during deployment you get a misleading error message: | ||
```shell | ||
Uploading application package ... failed | ||
Error: invalid application package (400 Bad Request) | ||
Invalid application: | ||
Unable to send file specified in com.yahoo.language.lucene.lucene-analysis: | ||
/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/8/foo (No such file or directory) | ||
``` | ||
|
||
### Application package root cannot be used as `configDir` | ||
|
||
If you try to be clever and set `<configDir>.</configDir>` then application package would be deployed(!) BUT | ||
not converge with the following error: | ||
```shell | ||
Uploading application package ... done | ||
|
||
Success: Deployed target/application.zip | ||
WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple] | ||
|
||
Waiting up to 1m40s for query service to become available ... | ||
Error: service 'query' is unavailable: services have not converged | ||
``` | ||
|
||
And Vespa logs would be filled with such warnings: | ||
```shell | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr Exception in thread "Rpc executorpool-6-thread-5" java.lang.RuntimeException: More than one file reference found for file 'fbcf5c3dc81d9540' | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:109) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:100) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFutureFile(FileDownloader.java:80) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFile(FileDownloader.java:70) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.downloadFile(FileDistributionRpcServer.java:109) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.lambda$getFile$0(FileDistributionRpcServer.java:84) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) | ||
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.lang.Thread.run(Thread.java:833) | ||
``` | ||
|
||
### Harmless warning | ||
`vespa deploy` always warns with: | ||
```shell | ||
WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple] | ||
``` | ||
You can ignore this warning. |
66 changes: 66 additions & 0 deletions
66
examples/lucene-linguistics/advanced-configuration/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Vespa Lucene Linguistics | ||
|
||
This Vespa application package (VAP) previews the configuration options of the `lucene-linguistics` package. | ||
Probably the main benefit of the `LuceneLinguistics` is the configurability when compared to other `Linguistics` implementations. | ||
|
||
## Custom Lucene Analyzers | ||
|
||
There are multiple ways to use a Lucene `Analyzer` for a language. | ||
Each analyzer is identified by a language key, e.g. 'en' for English language. | ||
These are Analyzer types in the order of descending priority: | ||
1. Created through the `Linguistics` component configuration. | ||
2. An `Analyzer` wrapped into a Vespa `<component>`. | ||
3. A list of [default Analyzers](https://github.com/vespa-engine/vespa/blob/5d26801bc63c35705e708d3cc7086f0b0103e909/lucene-linguistics/src/main/java/com/yahoo/language/lucene/DefaultAnalyzers.java) per language. | ||
4. The `StandardAnalyzer`. | ||
|
||
### Add a Lucene Analyzer component | ||
|
||
Vespa provides a `ComponentRegistry` mechanism. | ||
The `LuceneLinguistics` accepts a `ComponentRegistry<Analyzer>` into the constructor. | ||
Basically, the Vespa container at start time collects all the components that are of the `Analyzer` type automagically. | ||
|
||
To declare such components: | ||
```xml | ||
<component id="en" | ||
class="org.apache.lucene.analysis.core.SimpleAnalyzer" | ||
bundle="vespa-lucene-linguistics-poc" /> | ||
``` | ||
Where: | ||
- `id` should contain a language code. | ||
- `class` should be the implementing class. | ||
Note that it is a class straight from the Lucene library. | ||
Also, you can create an `Analyzer` class just inside your VAP and refer it. | ||
- `bundle` must be your application package `artifactId` as specified in the `pom.xml`. | ||
|
||
Here are two types of `Analyzer` components: | ||
1. That doesn't require any setup. | ||
2. That requires a setup (e.g. constructor with arguments). | ||
|
||
The previous component declaration example is of type (1). | ||
|
||
The (2) type requires a bit more work. | ||
|
||
Create a class (e.g. for the Polish language): | ||
```java | ||
package ai.vespa.linguistics.pl; | ||
|
||
import com.yahoo.container.di.componentgraph.Provider; | ||
import org.apache.lucene.analysis.Analyzer; | ||
|
||
public class PolishAnalyzer implements Provider<Analyzer> { | ||
@Override | ||
public Analyzer get() { | ||
return new org.apache.lucene.analysis.pl.PolishAnalyzer(); | ||
} | ||
@Override | ||
public void deconstruct() {} | ||
} | ||
``` | ||
|
||
Add a component declaration into the `services.xml` file: | ||
```xml | ||
<component id="pl" | ||
class="ai.vespa.linguistics.pl.PolishAnalyzer" | ||
bundle="vespa-lucene-linguistics-poc" /> | ||
``` | ||
And now you have the handling of the Polish language available. |
85 changes: 85 additions & 0 deletions
85
examples/lucene-linguistics/advanced-configuration/pom.xml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,85 @@ | ||
<?xml version="1.0"?> | ||
<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. --> | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" | ||
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" | ||
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 | ||
http://maven.apache.org/xsd/maven-4.0.0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
<groupId>ai.vespa</groupId> | ||
<artifactId>vespa-lucene-linguistics-poc</artifactId> | ||
<version>0.0.2</version> | ||
<packaging>container-plugin</packaging> | ||
<properties> | ||
<bundle-plugin.failOnWarnings>false</bundle-plugin.failOnWarnings> | ||
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> | ||
<test.hide>true</test.hide> | ||
<vespa.version>8.227.41</vespa.version> | ||
<junit.version>5.7.1</junit.version> | ||
</properties> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>org.apache.lucene</groupId> | ||
<artifactId>lucene-analysis-stempel</artifactId> | ||
<version>9.7.0</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.yahoo.vespa</groupId> | ||
<artifactId>lucene-linguistics</artifactId> | ||
<version>${vespa.version}</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.yahoo.vespa</groupId> | ||
<artifactId>linguistics</artifactId> | ||
<version>${vespa.version}</version> | ||
<scope>provided</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.yahoo.vespa</groupId> | ||
<artifactId>application</artifactId> | ||
<version>${vespa.version}</version> | ||
<scope>provided</scope> | ||
</dependency> | ||
<dependency> | ||
<groupId>org.junit.jupiter</groupId> | ||
<artifactId>junit-jupiter</artifactId> | ||
<version>${junit.version}</version> | ||
<scope>test</scope> | ||
</dependency> | ||
</dependencies> | ||
|
||
<build> | ||
<plugins> | ||
<plugin> | ||
<groupId>com.yahoo.vespa</groupId> | ||
<artifactId>bundle-plugin</artifactId> | ||
<version>${vespa.version}</version> | ||
<extensions>true</extensions> | ||
<configuration> | ||
<failOnWarnings>false</failOnWarnings> | ||
</configuration> | ||
</plugin> | ||
<plugin> | ||
<groupId>com.yahoo.vespa</groupId> | ||
<artifactId>vespa-application-maven-plugin</artifactId> | ||
<version>${vespa.version}</version> | ||
<executions> | ||
<execution> | ||
<goals> | ||
<goal>packageApplication</goal> | ||
</goals> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
<plugin> | ||
<groupId>org.apache.maven.plugins</groupId> | ||
<artifactId>maven-compiler-plugin</artifactId> | ||
<version>3.11.0</version> | ||
<configuration> | ||
<source>17</source> | ||
<target>17</target> | ||
</configuration> | ||
</plugin> | ||
</plugins> | ||
</build> | ||
</project> |
7 changes: 7 additions & 0 deletions
7
examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"put": "id:mynamespace:lucene::mydocid", | ||
"fields": { | ||
"language": "en", | ||
"mytext": "Cats and Dogs" | ||
} | ||
} |
15 changes: 15 additions & 0 deletions
15
examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
schema lucene { | ||
|
||
document lucene { | ||
field language type string { | ||
indexing: set_language | ||
} | ||
field mytext type string { | ||
indexing: summary | index | ||
} | ||
} | ||
|
||
fieldset default { | ||
fields: mytext | ||
} | ||
} |
Oops, something went wrong.