Skip to content

Commit

Permalink
Merge pull request #1264 from dainiusjocas/demo-lucene-linguistics
Browse files Browse the repository at this point in the history
Sample app with the LuceneLinguistics
  • Loading branch information
kkraune authored Sep 20, 2023
2 parents c651e9d + 3d5d127 commit c2b2b2d
Show file tree
Hide file tree
Showing 43 changed files with 1,054 additions and 0 deletions.
5 changes: 5 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,11 @@ Generic [request-response](https://docs.vespa.ai/en/jdisc/processing.html) proce
<!-- ToDo: FIXME -->


### Lucene Linguistics
The [lucene-linguistics](lucene-linguistics) contains two sample application packages:
1. A bare minimal app.
2. Shows advanced configuration of the Lucene based `Linguistics` implementation.

----

Note: Applications with _pom.xml_ are Java/Maven projects and must be built before being deployed.
Expand Down
195 changes: 195 additions & 0 deletions examples/lucene-linguistics/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,195 @@
<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->

![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png)

# Vespa LuceneLinguistics Demos

A couple of example of how to get started with the `lucene-linguistics`:

- `non-java`: an absolute minimum to get started;
- `minimal`: minimal Java based project using Lucene Linguistics;
- `advanced-configuration`: demonstrates the configurability;
- `going-crazy`: demonstrates the advanced setup;

## Getting started

For all application packages the procedure is the same:
go to the application package directory and play with the following commands:

```shell
# Of course make sure that your Docker daemon is running
# make sure that Vespa CLI is installed
brew install vespa-cli
# Maven must be 3.6+
brew install maven

docker run --rm --detach \
--name vespa \
--hostname vespa-container \
--publish 8080:8080 \
--publish 19071:19071 \
--publish 19050:19050 \
vespaengine/vespa:8.224.19

# To observe the logs from LuceneLinguistics run in a separate terminal
docker logs vespa -f | grep -i "lucene"

vespa status deploy --wait 300

(mvn clean package && vespa deploy -w 100)

vespa feed src/main/application/ext/document.json
vespa query 'yql=select * from lucene where default contains "dogs"' \
'model.locale=en'

# after this query log entry like this should appear:
[2023-08-02 19:57:12.106] INFO container Container.com.yahoo.language.lucene.AnalyzerFactory Analyzer for language=en is from a list of default language analyzers.
```

The query should return:
```json
{
"root": {
"id": "toplevel",
"relevance": 1.0,
"fields": {
"totalCount": 1
},
"coverage": {
"coverage": 100,
"documents": 1,
"full": true,
"nodes": 1,
"results": 1,
"resultsFull": 1
},
"children": [
{
"id": "id:mynamespace:lucene::mydocid",
"relevance": 0.16343879032006287,
"source": "content",
"fields": {
"sddocname": "lucene",
"documentid": "id:mynamespace:lucene::mydocid",
"mytext": "Cats and Dogs"
}
}
]
}
}
```

### Observing query rewrites

```shell
vespa query 'yql=select * from lucene where default contains "dogs"' \
'model.locale=en' \
'trace.level=2' | jq '.trace.children | last | .children[] | select(.message) | select(.message | test("YQL.*")) | .message'
```
Output
```shell
"YQL+ query parsed: [select * from lucene where default contains \"dog\" timeout 10000]"
```
See that the `dogs` rewritten as `dog`.

Change the `model.locale` to other language, change the query, and observe the analysis differences.

### Observing the indexed tokens

It is possible to explore the tokens directly in the index.
To do that you can run these commands **inside** the running Vespa Docker container.

```shell
# Into the Vespa docker
docker exec -it vespa bash
# Trigger the flushing to the disk
vespa-proton-cmd --local triggerFlush

# Show the posting lists
vespa-index-inspect showpostings \
--indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \
--field mytext --transpose
# =>
# docId = 1
# field = 0 "mytext"
# element = 0, elementLen = 2, elementWeight = 1
# pos = 0, word = "cat"
# pos = 1, word = "dog"

# Show the tokens
vespa-index-inspect dumpwords \
--indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \
--wordnum \
--field mytext
# =>
# 1 cat 1
# 2 dog 1
```

Have fun!

## Common Issues

The `lucene-linguistics` component is highly configurable.
It has an optional `configDir` configuration parameter of type `path`.
`configDir` is a directory to store linguistics resources, e.g. dictionaries with stopwords, etc., and is relative to the VAP root directory.

There are several known problems that might happen when `configDir` is misconfigured.

### `configDir` is specified but doesn't exist

If the `configDir` doesn't exist then `vespa deploy` would fail with such error:

```shell
Uploading application package ... failed
Error: invalid application package (400 Bad Request)
Invalid application:
Unable to send file specified in com.yahoo.language.lucene.lucene-analysis:
/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/4/lucene (No such file or directory)
```

### Empty directory can't be referred

If the `configDir` is set with `foo` which is empty then during deployment you get a misleading error message:
```shell
Uploading application package ... failed
Error: invalid application package (400 Bad Request)
Invalid application:
Unable to send file specified in com.yahoo.language.lucene.lucene-analysis:
/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/8/foo (No such file or directory)
```

### Application package root cannot be used as `configDir`

If you try to be clever and set `<configDir>.</configDir>` then application package would be deployed(!) BUT
not converge with the following error:
```shell
Uploading application package ... done

Success: Deployed target/application.zip
WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple]

Waiting up to 1m40s for query service to become available ...
Error: service 'query' is unavailable: services have not converged
```

And Vespa logs would be filled with such warnings:
```shell
[2023-08-02 20:30:47.675] WARNING configproxy stderr Exception in thread "Rpc executorpool-6-thread-5" java.lang.RuntimeException: More than one file reference found for file 'fbcf5c3dc81d9540'
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:109)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:100)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFutureFile(FileDownloader.java:80)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFile(FileDownloader.java:70)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.downloadFile(FileDistributionRpcServer.java:109)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.lambda$getFile$0(FileDistributionRpcServer.java:84)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.lang.Thread.run(Thread.java:833)
```

### Harmless warning
`vespa deploy` always warns with:
```shell
WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple]
```
You can ignore this warning.
66 changes: 66 additions & 0 deletions examples/lucene-linguistics/advanced-configuration/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Vespa Lucene Linguistics

This Vespa application package (VAP) previews the configuration options of the `lucene-linguistics` package.
Probably the main benefit of the `LuceneLinguistics` is the configurability when compared to other `Linguistics` implementations.

## Custom Lucene Analyzers

There are multiple ways to use a Lucene `Analyzer` for a language.
Each analyzer is identified by a language key, e.g. 'en' for English language.
These are Analyzer types in the order of descending priority:
1. Created through the `Linguistics` component configuration.
2. An `Analyzer` wrapped into a Vespa `<component>`.
3. A list of [default Analyzers](https://github.com/vespa-engine/vespa/blob/5d26801bc63c35705e708d3cc7086f0b0103e909/lucene-linguistics/src/main/java/com/yahoo/language/lucene/DefaultAnalyzers.java) per language.
4. The `StandardAnalyzer`.

### Add a Lucene Analyzer component

Vespa provides a `ComponentRegistry` mechanism.
The `LuceneLinguistics` accepts a `ComponentRegistry<Analyzer>` into the constructor.
Basically, the Vespa container at start time collects all the components that are of the `Analyzer` type automagically.

To declare such components:
```xml
<component id="en"
class="org.apache.lucene.analysis.core.SimpleAnalyzer"
bundle="vespa-lucene-linguistics-poc" />
```
Where:
- `id` should contain a language code.
- `class` should be the implementing class.
Note that it is a class straight from the Lucene library.
Also, you can create an `Analyzer` class just inside your VAP and refer it.
- `bundle` must be your application package `artifactId` as specified in the `pom.xml`.

Here are two types of `Analyzer` components:
1. That doesn't require any setup.
2. That requires a setup (e.g. constructor with arguments).

The previous component declaration example is of type (1).

The (2) type requires a bit more work.

Create a class (e.g. for the Polish language):
```java
package ai.vespa.linguistics.pl;

import com.yahoo.container.di.componentgraph.Provider;
import org.apache.lucene.analysis.Analyzer;

public class PolishAnalyzer implements Provider<Analyzer> {
@Override
public Analyzer get() {
return new org.apache.lucene.analysis.pl.PolishAnalyzer();
}
@Override
public void deconstruct() {}
}
```

Add a component declaration into the `services.xml` file:
```xml
<component id="pl"
class="ai.vespa.linguistics.pl.PolishAnalyzer"
bundle="vespa-lucene-linguistics-poc" />
```
And now you have the handling of the Polish language available.
85 changes: 85 additions & 0 deletions examples/lucene-linguistics/advanced-configuration/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
<?xml version="1.0"?>
<!-- Copyright Yahoo. Licensed under the terms of the Apache 2.0 license. See LICENSE in the project root. -->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>ai.vespa</groupId>
<artifactId>vespa-lucene-linguistics-poc</artifactId>
<version>0.0.2</version>
<packaging>container-plugin</packaging>
<properties>
<bundle-plugin.failOnWarnings>false</bundle-plugin.failOnWarnings>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<test.hide>true</test.hide>
<vespa.version>8.227.41</vespa.version>
<junit.version>5.7.1</junit.version>
</properties>

<dependencies>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-stempel</artifactId>
<version>9.7.0</version>
</dependency>
<dependency>
<groupId>com.yahoo.vespa</groupId>
<artifactId>lucene-linguistics</artifactId>
<version>${vespa.version}</version>
</dependency>
<dependency>
<groupId>com.yahoo.vespa</groupId>
<artifactId>linguistics</artifactId>
<version>${vespa.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.yahoo.vespa</groupId>
<artifactId>application</artifactId>
<version>${vespa.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<version>${junit.version}</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<plugins>
<plugin>
<groupId>com.yahoo.vespa</groupId>
<artifactId>bundle-plugin</artifactId>
<version>${vespa.version}</version>
<extensions>true</extensions>
<configuration>
<failOnWarnings>false</failOnWarnings>
</configuration>
</plugin>
<plugin>
<groupId>com.yahoo.vespa</groupId>
<artifactId>vespa-application-maven-plugin</artifactId>
<version>${vespa.version}</version>
<executions>
<execution>
<goals>
<goal>packageApplication</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.11.0</version>
<configuration>
<source>17</source>
<target>17</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
{
"put": "id:mynamespace:lucene::mydocid",
"fields": {
"language": "en",
"mytext": "Cats and Dogs"
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
schema lucene {

document lucene {
field language type string {
indexing: set_language
}
field mytext type string {
indexing: summary | index
}
}

fieldset default {
fields: mytext
}
}
Loading

0 comments on commit c2b2b2d

Please sign in to comment.