diff --git a/examples/README.md b/examples/README.md index e2b7ef2c8..6a4048333 100644 --- a/examples/README.md +++ b/examples/README.md @@ -43,6 +43,11 @@ Generic [request-response](https://docs.vespa.ai/en/jdisc/processing.html) proce +### Lucene Linguistics +The [lucene-linguistics](lucene-linguistics) contains two sample application packages: +1. A bare minimal app. +2. Shows advanced configuration of the Lucene based `Linguistics` implementation. + ---- Note: Applications with _pom.xml_ are Java/Maven projects and must be built before being deployed. diff --git a/examples/lucene-linguistics/README.md b/examples/lucene-linguistics/README.md new file mode 100644 index 000000000..413c2d527 --- /dev/null +++ b/examples/lucene-linguistics/README.md @@ -0,0 +1,195 @@ + + +![Vespa logo](https://vespa.ai/assets/vespa-logo-color.png) + +# Vespa LuceneLinguistics Demos + +A couple of example of how to get started with the `lucene-linguistics`: + +- `non-java`: an absolute minimum to get started; +- `minimal`: minimal Java based project using Lucene Linguistics; +- `advanced-configuration`: demonstrates the configurability; +- `going-crazy`: demonstrates the advanced setup; + +## Getting started + +For all application packages the procedure is the same: +go to the application package directory and play with the following commands: + +```shell +# Of course make sure that your Docker daemon is running +# make sure that Vespa CLI is installed +brew install vespa-cli +# Maven must be 3.6+ +brew install maven + +docker run --rm --detach \ + --name vespa \ + --hostname vespa-container \ + --publish 8080:8080 \ + --publish 19071:19071 \ + --publish 19050:19050 \ + vespaengine/vespa:8.224.19 + +# To observe the logs from LuceneLinguistics run in a separate terminal +docker logs vespa -f | grep -i "lucene" + +vespa status deploy --wait 300 + +(mvn clean package && vespa deploy -w 100) + +vespa feed src/main/application/ext/document.json +vespa query 'yql=select * from lucene where default contains "dogs"' \ + 'model.locale=en' + +# after this query log entry like this should appear: +[2023-08-02 19:57:12.106] INFO container Container.com.yahoo.language.lucene.AnalyzerFactory Analyzer for language=en is from a list of default language analyzers. +``` + +The query should return: +```json +{ + "root": { + "id": "toplevel", + "relevance": 1.0, + "fields": { + "totalCount": 1 + }, + "coverage": { + "coverage": 100, + "documents": 1, + "full": true, + "nodes": 1, + "results": 1, + "resultsFull": 1 + }, + "children": [ + { + "id": "id:mynamespace:lucene::mydocid", + "relevance": 0.16343879032006287, + "source": "content", + "fields": { + "sddocname": "lucene", + "documentid": "id:mynamespace:lucene::mydocid", + "mytext": "Cats and Dogs" + } + } + ] + } +} +``` + +### Observing query rewrites + +```shell +vespa query 'yql=select * from lucene where default contains "dogs"' \ + 'model.locale=en' \ + 'trace.level=2' | jq '.trace.children | last | .children[] | select(.message) | select(.message | test("YQL.*")) | .message' +``` +Output +```shell +"YQL+ query parsed: [select * from lucene where default contains \"dog\" timeout 10000]" +``` +See that the `dogs` rewritten as `dog`. + +Change the `model.locale` to other language, change the query, and observe the analysis differences. + +### Observing the indexed tokens + +It is possible to explore the tokens directly in the index. +To do that you can run these commands **inside** the running Vespa Docker container. + +```shell +# Into the Vespa docker +docker exec -it vespa bash +# Trigger the flushing to the disk +vespa-proton-cmd --local triggerFlush + +# Show the posting lists +vespa-index-inspect showpostings \ + --indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \ + --field mytext --transpose +# => +# docId = 1 +# field = 0 "mytext" +# element = 0, elementLen = 2, elementWeight = 1 +# pos = 0, word = "cat" +# pos = 1, word = "dog" + +# Show the tokens +vespa-index-inspect dumpwords \ + --indexdir /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/$(ls /opt/vespa/var/db/vespa/search/cluster.content/n0/documents/lucene/0.ready/index/)/ \ + --wordnum \ + --field mytext +# => +# 1 cat 1 +# 2 dog 1 +``` + +Have fun! + +## Common Issues + +The `lucene-linguistics` component is highly configurable. +It has an optional `configDir` configuration parameter of type `path`. +`configDir` is a directory to store linguistics resources, e.g. dictionaries with stopwords, etc., and is relative to the VAP root directory. + +There are several known problems that might happen when `configDir` is misconfigured. + +### `configDir` is specified but doesn't exist + +If the `configDir` doesn't exist then `vespa deploy` would fail with such error: + +```shell +Uploading application package ... failed +Error: invalid application package (400 Bad Request) +Invalid application: +Unable to send file specified in com.yahoo.language.lucene.lucene-analysis: +/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/4/lucene (No such file or directory) +``` + +### Empty directory can't be referred + +If the `configDir` is set with `foo` which is empty then during deployment you get a misleading error message: +```shell +Uploading application package ... failed +Error: invalid application package (400 Bad Request) +Invalid application: +Unable to send file specified in com.yahoo.language.lucene.lucene-analysis: +/opt/vespa/var/db/vespa/config_server/serverdb/tenants/default/sessions/8/foo (No such file or directory) +``` + +### Application package root cannot be used as `configDir` + +If you try to be clever and set `.` then application package would be deployed(!) BUT +not converge with the following error: +```shell +Uploading application package ... done + +Success: Deployed target/application.zip +WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple] + +Waiting up to 1m40s for query service to become available ... +Error: service 'query' is unavailable: services have not converged +``` + +And Vespa logs would be filled with such warnings: +```shell +[2023-08-02 20:30:47.675] WARNING configproxy stderr Exception in thread "Rpc executorpool-6-thread-5" java.lang.RuntimeException: More than one file reference found for file 'fbcf5c3dc81d9540' +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:109) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFileFromFileSystem(FileDownloader.java:100) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFutureFile(FileDownloader.java:80) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.filedistribution.FileDownloader.getFile(FileDownloader.java:70) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.downloadFile(FileDistributionRpcServer.java:109) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat com.yahoo.vespa.config.proxy.filedistribution.FileDistributionRpcServer.lambda$getFile$0(FileDistributionRpcServer.java:84) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) +[2023-08-02 20:30:47.675] WARNING configproxy stderr \tat java.base/java.lang.Thread.run(Thread.java:833) +``` + +### Harmless warning +`vespa deploy` always warns with: +```shell +WARNING Jar file 'vespa-lucene-linguistics-poc-0.0.1-deploy.jar' uses non-public Vespa APIs: [com.yahoo.language.simple] +``` +You can ignore this warning. diff --git a/examples/lucene-linguistics/advanced-configuration/README.md b/examples/lucene-linguistics/advanced-configuration/README.md new file mode 100644 index 000000000..be49d1ceb --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/README.md @@ -0,0 +1,66 @@ +# Vespa Lucene Linguistics + +This Vespa application package (VAP) previews the configuration options of the `lucene-linguistics` package. +Probably the main benefit of the `LuceneLinguistics` is the configurability when compared to other `Linguistics` implementations. + +## Custom Lucene Analyzers + +There are multiple ways to use a Lucene `Analyzer` for a language. +Each analyzer is identified by a language key, e.g. 'en' for English language. +These are Analyzer types in the order of descending priority: +1. Created through the `Linguistics` component configuration. +2. An `Analyzer` wrapped into a Vespa ``. +3. A list of [default Analyzers](https://github.com/vespa-engine/vespa/blob/5d26801bc63c35705e708d3cc7086f0b0103e909/lucene-linguistics/src/main/java/com/yahoo/language/lucene/DefaultAnalyzers.java) per language. +4. The `StandardAnalyzer`. + +### Add a Lucene Analyzer component + +Vespa provides a `ComponentRegistry` mechanism. +The `LuceneLinguistics` accepts a `ComponentRegistry` into the constructor. +Basically, the Vespa container at start time collects all the components that are of the `Analyzer` type automagically. + +To declare such components: +```xml + +``` +Where: +- `id` should contain a language code. +- `class` should be the implementing class. +Note that it is a class straight from the Lucene library. +Also, you can create an `Analyzer` class just inside your VAP and refer it. +- `bundle` must be your application package `artifactId` as specified in the `pom.xml`. + +Here are two types of `Analyzer` components: +1. That doesn't require any setup. +2. That requires a setup (e.g. constructor with arguments). + +The previous component declaration example is of type (1). + +The (2) type requires a bit more work. + +Create a class (e.g. for the Polish language): +```java +package ai.vespa.linguistics.pl; + +import com.yahoo.container.di.componentgraph.Provider; +import org.apache.lucene.analysis.Analyzer; + +public class PolishAnalyzer implements Provider { + @Override + public Analyzer get() { + return new org.apache.lucene.analysis.pl.PolishAnalyzer(); + } + @Override + public void deconstruct() {} +} +``` + +Add a component declaration into the `services.xml` file: +```xml + +``` +And now you have the handling of the Polish language available. diff --git a/examples/lucene-linguistics/advanced-configuration/pom.xml b/examples/lucene-linguistics/advanced-configuration/pom.xml new file mode 100644 index 000000000..c98a31c90 --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/pom.xml @@ -0,0 +1,85 @@ + + + + 4.0.0 + ai.vespa + vespa-lucene-linguistics-poc + 0.0.2 + container-plugin + + false + UTF-8 + true + 8.227.41 + 5.7.1 + + + + + org.apache.lucene + lucene-analysis-stempel + 9.7.0 + + + com.yahoo.vespa + lucene-linguistics + ${vespa.version} + + + com.yahoo.vespa + linguistics + ${vespa.version} + provided + + + com.yahoo.vespa + application + ${vespa.version} + provided + + + org.junit.jupiter + junit-jupiter + ${junit.version} + test + + + + + + + com.yahoo.vespa + bundle-plugin + ${vespa.version} + true + + false + + + + com.yahoo.vespa + vespa-application-maven-plugin + ${vespa.version} + + + + packageApplication + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.11.0 + + 17 + 17 + + + + + diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json b/examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json new file mode 100644 index 000000000..f7733dd3a --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/src/main/application/ext/document.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::mydocid", + "fields": { + "language": "en", + "mytext": "Cats and Dogs" + } +} diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd b/examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd new file mode 100644 index 000000000..86f7b7c3c --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/src/main/application/schemas/lucene.sd @@ -0,0 +1,15 @@ +schema lucene { + + document lucene { + field language type string { + indexing: set_language + } + field mytext type string { + indexing: summary | index + } + } + + fieldset default { + fields: mytext + } +} diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/application/services.xml b/examples/lucene-linguistics/advanced-configuration/src/main/application/services.xml new file mode 100644 index 000000000..8df239609 --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/src/main/application/services.xml @@ -0,0 +1,25 @@ + + + + + + + + + + + + + + + 1 + + + + + + diff --git a/examples/lucene-linguistics/advanced-configuration/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java b/examples/lucene-linguistics/advanced-configuration/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java new file mode 100644 index 000000000..f2697b4bc --- /dev/null +++ b/examples/lucene-linguistics/advanced-configuration/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java @@ -0,0 +1,14 @@ +package ai.vespa.linguistics.pl; + +import com.yahoo.container.di.componentgraph.Provider; +import org.apache.lucene.analysis.Analyzer; + +public class PolishAnalyzer implements Provider { + @Override + public Analyzer get() { + return new org.apache.lucene.analysis.pl.PolishAnalyzer(); + } + + @Override + public void deconstruct() {} +} diff --git a/examples/lucene-linguistics/going-crazy/README.md b/examples/lucene-linguistics/going-crazy/README.md new file mode 100644 index 000000000..6d5b77282 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/README.md @@ -0,0 +1,32 @@ +# Vespa Lucene Linguistics: Going Crazy + +## TL;DR + +Search problems get really complicated when you need to deal with multilingual aspects. +Lucene has a battle-tested and standards compliant set of available libraries to help you solve your problems. + +## Context + +The goals of this application package are: +- set up OpenNLP tokenizers; +- set up Lemmagen token filters with sample resource files; +- construct an analyzer entirely in Java code and register it as a component; + +## Analysis components + +Lucene has plenty of components [available](https://lucene.apache.org/core/9_7_0/index.html). +One of which is [`analysis-opennlp`](https://lucene.apache.org/core/9_7_0/analysis/opennlp/index.html). + +### OpenNLP + +The OpenNLP library adds 1 tokenizer identified with `openNlp`, and 3 token filters: +`openNlpLemmatizer`, `openNlpChunker`, `openNlppos`. + +Let's set a `org.apache.lucene.analysis.opennlp.OpenNLPTokenizerFactory` and +`org.apache.lucene.analysis.snowball.SnowballPorterFilterFactory`. + +### Feed Documents + +```shell +vespa feed src/main/application/ext/documents/* +``` diff --git a/examples/lucene-linguistics/going-crazy/pom.xml b/examples/lucene-linguistics/going-crazy/pom.xml new file mode 100644 index 000000000..38792c9a0 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/pom.xml @@ -0,0 +1,112 @@ + + + + 4.0.0 + ai.vespa + vespa-lucene-linguistics-crazy + 0.0.2 + container-plugin + + false + UTF-8 + true + 8.227.41 + 5.7.1 + 9.7.0 + + + + + com.yahoo.vespa + lucene-linguistics + ${vespa.version} + + + org.apache.lucene + lucene-core + ${lucene.version} + + + org.apache.lucene + lucene-analysis-common + ${lucene.version} + + + org.apache.lucene + lucene-analysis-opennlp + ${lucene.version} + + + org.apache.lucene + lucene-analysis-stempel + ${lucene.version} + + + eu.hlavki.text + jlemmagen + 1.0 + + + org.slf4j + slf4j-api + + + + + com.yahoo.vespa + linguistics + ${vespa.version} + provided + + + com.yahoo.vespa + application + ${vespa.version} + provided + + + org.junit.jupiter + junit-jupiter + ${junit.version} + test + + + + + + + com.yahoo.vespa + bundle-plugin + ${vespa.version} + true + + false + + + + com.yahoo.vespa + vespa-application-maven-plugin + ${vespa.version} + + + + packageApplication + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.11.0 + + 17 + 17 + + + + + diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/de-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/de-doc.json new file mode 100644 index 000000000..c65025fd6 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/de-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::de-doc", + "fields": { + "language": "de", + "mytext": "Katzen und Hunde" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/en-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/en-doc.json new file mode 100644 index 000000000..a8dbb4e03 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/en-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::en-doc", + "fields": { + "language": "en", + "mytext": "Cats and Dogs" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/fr-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/fr-doc.json new file mode 100644 index 000000000..f7f1d4303 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/fr-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::fr-doc", + "fields": { + "language": "fr", + "mytext": "Les chats et les chiens" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/it-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/it-doc.json new file mode 100644 index 000000000..bad2d467f --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/it-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::it-doc", + "fields": { + "language": "it", + "mytext": "Cani e gatti" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/nl-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/nl-doc.json new file mode 100644 index 000000000..0257c0509 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/nl-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::nl-doc", + "fields": { + "language": "nl", + "mytext": "Katten en honden" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/pl-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/pl-doc.json new file mode 100644 index 000000000..950d858bd --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/pl-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::pl-doc", + "fields": { + "language": "pl", + "mytext": "Koty i psy" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/sk-doc.json b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/sk-doc.json new file mode 100644 index 000000000..7fc79ad6b --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/ext/documents/sk-doc.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::sk-doc", + "fields": { + "language": "sk", + "mytext": "Mačky a psy" + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-sentence-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-sentence-1.0-1.9.3.bin new file mode 100644 index 000000000..9e8dfa5bc Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-sentence-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-tokens-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-tokens-1.0-1.9.3.bin new file mode 100644 index 000000000..eb7d7708f Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/de/opennlp-de-ud-gsd-tokens-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin new file mode 100644 index 000000000..d3a277923 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin new file mode 100644 index 000000000..10c7d02d2 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/en/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-1.0-1.9.3fr-ud-ftb-sentence-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-1.0-1.9.3fr-ud-ftb-sentence-1.0-1.9.3.bin new file mode 100644 index 000000000..7ca04d3d2 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-1.0-1.9.3fr-ud-ftb-sentence-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-fr-ud-ftb-tokens-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-fr-ud-ftb-tokens-1.0-1.9.3.bin new file mode 100644 index 000000000..3343de95a Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/fr/opennlp-fr-ud-ftb-tokens-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin new file mode 100644 index 000000000..446a3a4ec Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin new file mode 100644 index 000000000..9f58f8d61 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/it/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-sentence-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-sentence-1.0-1.9.3.bin new file mode 100644 index 000000000..f5f28f0f6 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-sentence-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-tokens-1.0-1.9.3.bin b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-tokens-1.0-1.9.3.bin new file mode 100644 index 000000000..b721a04c8 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/nl/opennlp-nl-ud-alpino-tokens-1.0-1.9.3.bin differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/sk/mlteast-sk.lem b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/sk/mlteast-sk.lem new file mode 100644 index 000000000..dc1d57820 Binary files /dev/null and b/examples/lucene-linguistics/going-crazy/src/main/application/linguistics/sk/mlteast-sk.lem differ diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/schemas/lucene.sd b/examples/lucene-linguistics/going-crazy/src/main/application/schemas/lucene.sd new file mode 100644 index 000000000..86f7b7c3c --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/schemas/lucene.sd @@ -0,0 +1,15 @@ +schema lucene { + + document lucene { + field language type string { + indexing: set_language + } + field mytext type string { + indexing: summary | index + } + } + + fieldset default { + fields: mytext + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/application/services.xml b/examples/lucene-linguistics/going-crazy/src/main/application/services.xml new file mode 100644 index 000000000..1240c8236 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/application/services.xml @@ -0,0 +1,125 @@ + + + + + + + + linguistics + + + + openNLP + + de/opennlp-de-ud-gsd-sentence-1.0-1.9.3.bin + de/opennlp-de-ud-gsd-tokens-1.0-1.9.3.bin + + + + + snowballPorter + + German2 + + + + + + + openNLP + + en/opennlp-en-ud-ewt-sentence-1.0-1.9.3.bin + en/opennlp-en-ud-ewt-tokens-1.0-1.9.3.bin + + + + + snowballPorter + + English + + + + + + + openNLP + + fr/opennlp-1.0-1.9.3fr-ud-ftb-sentence-1.0-1.9.3.bin + fr/opennlp-fr-ud-ftb-tokens-1.0-1.9.3.bin + + + + + snowballPorter + + French + + + + + + + openNLP + + it/opennlp-it-ud-vit-sentence-1.0-1.9.3.bin + it/opennlp-it-ud-vit-tokens-1.0-1.9.3.bin + + + + + snowballPorter + + Italian + + + + + + + openNLP + + nl/opennlp-nl-ud-alpino-sentence-1.0-1.9.3.bin + nl/opennlp-nl-ud-alpino-tokens-1.0-1.9.3.bin + + + + + snowballPorter + + Dutch + + + reversestring + + + + + + lemmagen + + + sk/mlteast-sk.lem + + + + + + + + + + + + + + 1 + + + + + + diff --git a/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilter.java b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilter.java new file mode 100644 index 000000000..754b34052 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilter.java @@ -0,0 +1,49 @@ +package ai.vespa.linguistics.lemmagen; + +import eu.hlavki.text.lemmagen.api.Lemmatizer; +import org.apache.lucene.analysis.TokenFilter; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; +import org.apache.lucene.analysis.tokenattributes.KeywordAttribute; + +import java.io.IOException; + +/** + * Code is loosely based on + * https://github.com/vhyza/elasticsearch-analysis-lemmagen/blob/master/src/main/java/org/elasticsearch/index/analysis/LemmagenFilter.java + */ +public final class LemmagenTokenFilter extends TokenFilter { + + private final CharTermAttribute termAttr = addAttribute(CharTermAttribute.class); + private final KeywordAttribute keywordAttr = addAttribute(KeywordAttribute.class); + private final Lemmatizer lemmatizer; + + public LemmagenTokenFilter(final TokenStream input, final Lemmatizer lemmatizer) { + super(input); + this.lemmatizer = lemmatizer; + } + + public boolean incrementToken() throws IOException { + if (!input.incrementToken()) { + return false; + } + CharSequence lemma = lemmatizer.lemmatize(termAttr); + if (!keywordAttr.isKeyword() && !equalCharSequences(lemma, termAttr)) { + termAttr.setEmpty().append(lemma); + } + return true; + } + + private boolean equalCharSequences(CharSequence s1, CharSequence s2) { + int len1 = s1.length(); + int len2 = s2.length(); + if (len1 != len2) + return false; + for (int i = len1; --i >= 0;) { + if (s1.charAt(i) != s2.charAt(i)) { + return false; + } + } + return true; + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilterFactory.java b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilterFactory.java new file mode 100644 index 000000000..6eb8ef1c2 --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/lemmagen/LemmagenTokenFilterFactory.java @@ -0,0 +1,62 @@ +package ai.vespa.linguistics.lemmagen; + +import eu.hlavki.text.lemmagen.LemmatizerFactory; +import eu.hlavki.text.lemmagen.api.Lemmatizer; +import org.apache.lucene.analysis.TokenFilterFactory; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.util.ResourceLoader; +import org.apache.lucene.util.ResourceLoaderAware; + +import java.io.IOException; +import java.io.InputStream; +import java.util.Map; + +/** + * https://lucene.apache.org/core/9_7_0/ + * https://github.com/vhyza/elasticsearch-analysis-lemmagen + * Loosely based on + * https://github.com/vhyza/elasticsearch-analysis-lemmagen/blob/master/src/main/java/org/elasticsearch/index/analysis/LemmagenFilterFactory.java + * Also inspired by + * https://github.com/hlavki/jlemmagen-lucene/blob/master/src/main/java/org/apache/lucene/analysis/lemmagen/LemmagenFilterFactory.java + */ +public class LemmagenTokenFilterFactory extends TokenFilterFactory + implements ResourceLoaderAware { + + // SPI name + public static final String NAME = "lemmagen"; + + // Configuration key + private static final String LEXICON_KEY = "lexicon"; + private Lemmatizer lemmatizer = null; + private final String lexiconPath; + + /** Creates a new LemmagenTokenFilterFactory */ + public LemmagenTokenFilterFactory(Map args) { + super(args); + lexiconPath = require(args, LEXICON_KEY); + if (!args.isEmpty()) { + throw new IllegalArgumentException("Unknown parameters: " + args); + } + } + + private Lemmatizer createLemmatizer(InputStream lexiconInputStream) { + try { + return LemmatizerFactory.read(lexiconInputStream); + } catch (IOException e) { + throw new RuntimeException(e); + } + } + + @Override + public void inform(ResourceLoader loader) throws IOException { + this.lemmatizer = createLemmatizer(loader.openResource(lexiconPath)); + } + + public LemmagenTokenFilterFactory() { + throw defaultCtorException(); + } + + public TokenStream create(TokenStream input) { + return new LemmagenTokenFilter(input, lemmatizer); + } +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java new file mode 100644 index 000000000..f2697b4bc --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/java/ai/vespa/linguistics/pl/PolishAnalyzer.java @@ -0,0 +1,14 @@ +package ai.vespa.linguistics.pl; + +import com.yahoo.container.di.componentgraph.Provider; +import org.apache.lucene.analysis.Analyzer; + +public class PolishAnalyzer implements Provider { + @Override + public Analyzer get() { + return new org.apache.lucene.analysis.pl.PolishAnalyzer(); + } + + @Override + public void deconstruct() {} +} diff --git a/examples/lucene-linguistics/going-crazy/src/main/resources/META-INF/services/org.apache.lucene.analysis.TokenFilterFactory b/examples/lucene-linguistics/going-crazy/src/main/resources/META-INF/services/org.apache.lucene.analysis.TokenFilterFactory new file mode 100644 index 000000000..39ee6fc5d --- /dev/null +++ b/examples/lucene-linguistics/going-crazy/src/main/resources/META-INF/services/org.apache.lucene.analysis.TokenFilterFactory @@ -0,0 +1 @@ +ai.vespa.linguistics.lemmagen.LemmagenTokenFilterFactory diff --git a/examples/lucene-linguistics/minimal/README.md b/examples/lucene-linguistics/minimal/README.md new file mode 100644 index 000000000..cdb2673cc --- /dev/null +++ b/examples/lucene-linguistics/minimal/README.md @@ -0,0 +1,3 @@ +# Minimal `lucene-linguistics` setup + +This application package contains a bare minimal setup to get started with the `lucene-linguistics`. diff --git a/examples/lucene-linguistics/minimal/pom.xml b/examples/lucene-linguistics/minimal/pom.xml new file mode 100644 index 000000000..4ece49073 --- /dev/null +++ b/examples/lucene-linguistics/minimal/pom.xml @@ -0,0 +1,73 @@ + + + + 4.0.0 + ai.vespa + lucene-linguistics-minimal + 0.0.1 + container-plugin + + false + UTF-8 + true + 8.227.41 + + + + + com.yahoo.vespa + lucene-linguistics + ${vespa.version} + + + com.yahoo.vespa + linguistics + ${vespa.version} + provided + + + com.yahoo.vespa + application + ${vespa.version} + provided + + + + + + + com.yahoo.vespa + bundle-plugin + ${vespa.version} + true + + false + + + + com.yahoo.vespa + vespa-application-maven-plugin + ${vespa.version} + + + + packageApplication + + + + + + org.apache.maven.plugins + maven-compiler-plugin + 3.11.0 + + 17 + 17 + + + + + diff --git a/examples/lucene-linguistics/minimal/src/main/application/ext/document.json b/examples/lucene-linguistics/minimal/src/main/application/ext/document.json new file mode 100644 index 000000000..f7733dd3a --- /dev/null +++ b/examples/lucene-linguistics/minimal/src/main/application/ext/document.json @@ -0,0 +1,7 @@ +{ + "put": "id:mynamespace:lucene::mydocid", + "fields": { + "language": "en", + "mytext": "Cats and Dogs" + } +} diff --git a/examples/lucene-linguistics/minimal/src/main/application/schemas/lucene.sd b/examples/lucene-linguistics/minimal/src/main/application/schemas/lucene.sd new file mode 100644 index 000000000..86f7b7c3c --- /dev/null +++ b/examples/lucene-linguistics/minimal/src/main/application/schemas/lucene.sd @@ -0,0 +1,15 @@ +schema lucene { + + document lucene { + field language type string { + indexing: set_language + } + field mytext type string { + indexing: summary | index + } + } + + fieldset default { + fields: mytext + } +} diff --git a/examples/lucene-linguistics/minimal/src/main/application/services.xml b/examples/lucene-linguistics/minimal/src/main/application/services.xml new file mode 100644 index 000000000..9de5c9879 --- /dev/null +++ b/examples/lucene-linguistics/minimal/src/main/application/services.xml @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + 1 + + + + + + diff --git a/examples/lucene-linguistics/non-java/.gitignore b/examples/lucene-linguistics/non-java/.gitignore new file mode 100644 index 000000000..44be31f2f --- /dev/null +++ b/examples/lucene-linguistics/non-java/.gitignore @@ -0,0 +1 @@ +components diff --git a/examples/lucene-linguistics/non-java/README.md b/examples/lucene-linguistics/non-java/README.md new file mode 100644 index 000000000..a17745528 --- /dev/null +++ b/examples/lucene-linguistics/non-java/README.md @@ -0,0 +1,27 @@ +# Lucene Linguistics in non-Java Vespa applications + +In non-java projects it is possible to use Lucene Linguistics as a jar bundle. + +Download and add the Vespa bundle jar into the `components` directory: +```shell +(mkdir -p components && cd components && curl -L https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases/download/v0.0.2/lucene-linguistics-bundle-0.0.2-deploy.jar --output lucene-linguistics-bundle-0.0.2-deploy.jar) +``` + +Deploy the application package: +```shell +vespa deploy -w 100 +``` + +Run a query: +```shell +vespa query 'query=Vespa' 'language=lt' +``` + +The logs should contain record: +```text +[2023-08-16 11:21:04.847] INFO container Container.com.yahoo.language.lucene.AnalyzerFactory Analyzer for language=lt is from a list of default language analyzers. +``` + +Profit. + +The jar is hosted on [Github](https://github.com/dainiusjocas/vespa-lucene-linguistics-bundle/releases). diff --git a/examples/lucene-linguistics/non-java/schemas/lucene.sd b/examples/lucene-linguistics/non-java/schemas/lucene.sd new file mode 100644 index 000000000..86f7b7c3c --- /dev/null +++ b/examples/lucene-linguistics/non-java/schemas/lucene.sd @@ -0,0 +1,15 @@ +schema lucene { + + document lucene { + field language type string { + indexing: set_language + } + field mytext type string { + indexing: summary | index + } + } + + fieldset default { + fields: mytext + } +} diff --git a/examples/lucene-linguistics/non-java/services.xml b/examples/lucene-linguistics/non-java/services.xml new file mode 100644 index 000000000..662398418 --- /dev/null +++ b/examples/lucene-linguistics/non-java/services.xml @@ -0,0 +1,20 @@ + + + + + + + + + + + + 1 + + + + + +