Lucene/Solr 9; Java 17; Tomcat 10 #526

jan-niestadt · 2024-07-11T09:42:16Z

Thanks to @eduarddrenth for this branch, which updates our previous experiment and solves more issues.

CURRENT STATUS: working, experimental. Will probably be merged in after releasing v4 soon

Old comments:

Main issue now seems to be that indexes created with Lucene 8 cannot be read by this Lucene 9 version. Normally, Lucene 9 should not have any issue reading Lucene 8 indexes, but our custom Codec probably causes issues with this. You can see this when running the tests, some of which run against a precreated (with Lucene 8) test index.

…nction)

This contains the more "interesting" migrations (more than just a removed function or import path) - poms updated to use latest version of jackson/jersey/xml-bind (might miss a runtime dependency still). Servlet-api still on v4 due to packaged jetty server in SOLR being stuck on v4. - implement most simple (oneliner) migrations, should be doublechecked - one remaining compilation error in BLSpanOrQuery that should be solved.

This gives it access to the package-private class used by SpanOrQuery. Might seem like a hack, but these classes are so intertwined with how Lucene works internally that they might as well live in the same package. Arguably most subclasses of BLSpanQuery and BLSpans should be moved to the same package eventually. We might need to still double-check that BLSpanTermQuery and BLSpanMultiTermQueryWrapper are up to date with their Lucene 9 counterparts.

update

1 solr test failing

exclude solr module untill jakarta compatible version is out

# Conflicts: # engine/pom.xml # engine/src/main/java/nl/inl/blacklab/indexers/config/saxon/XPathFinder.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansCaptureRelationsWithinSpan.java

no try catch needed LoggingWatcher

…riment/tomcat-10 # Conflicts: # engine/pom.xml # engine/src/main/java/nl/inl/blacklab/codec/BlackLab40StoredFieldsReader.java # engine/src/main/java/nl/inl/blacklab/search/SingleDocIdFilter.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpans.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpansInBuckets.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLFilterDocsSpans.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanMultiTermQueryWrapper.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanQuery.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryAnyToken.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryNoHits.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQuerySequence.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansAnd.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansFiltered.java # engine/src/main/java/nl/inl/blacklab/search/results/HitsInternal.java # engine/src/main/java/nl/inl/util/DocValuesUtil.java # engine/src/main/java/org/apache/lucene/queries/spans/BLSpanOrQuery.java # pom.xml # proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/ParsePatternResponse.java # proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/SummaryTextPattern.java # solr/pom.xml # solr/src/main/java/org/ivdnt/blacklab/solr/DocSetFilter.java # wslib/src/main/java/nl/inl/blacklab/server/exceptions/BadRequest.java # wslib/src/main/java/nl/inl/blacklab/server/lib/Response.java

correct analyzers version solr.XSLTResponseWriter not found

full classname for xsltresponsewriter new lucene version => rebuild index in test resources jakarta needed in solr tests

jan-niestadt · 2024-07-11T11:19:34Z

After making sure BlackLab40Codec uses Lucene87 as the delegate codec (previously it requested the default codec from Lucene, which is obvisouly different in Lucene 9), we still have a problem reading back our custom terms file.

The cause seems to be that DataInput/DataOutput have been switched to little-endian. Presumably we need to update our custom codec to explicitly read big-endian, which is how these indexes were written.

We should also add a new codec version, e.g. BlackLab41, that will use the new 9.x Lucene codec as delegate (or even adapts, using the default when creating, and recording what delegate codec was used so it can be reinstantiated when reading later). That version of the codec can use little-endian for the custom terms files as well.

When reading an index written using Lucene 8, we need to make sure we use EndiannessReverserUtil when opening the file, because Lucene 9 switched to little-endian for DataInput/Output.

The previous codec, BlackLab40Codec, can still be used to read indexes created with BlackLab 4.x / Lucene 8.x, but you cannot create new corpora or add to existing ones with this codec. The new (default) BlackLab50Codec is essentially the same, except it uses Lucene99 as its base codec, including Lucene's changed endianness.

jan-niestadt · 2024-07-16T11:58:09Z

Everything seems to work now. There's a new codec (BlackLab50Codec) that essentially identical to the previous one, except it's based on Lucene99 instead of Lucene87, so you can write indexes with it. Solr cache was switched to CaffeineCache ((Fast)LRU was removed). Dockerfiles have been updated to Tomcat 10.

…ches. The optional alignment operator should always include all matches for the source clause (left side of the operator), regardless of whether there's matches for the target clause.

Before, neither []* []* nor _ _ would be recognized and simplified to a single []*, which would cause bad performance. It's easy to mistake _ for meaning "any single token" (actually it's 0 or more tokens), so users are likely to construct parallel queries like "de" "koe" ==>en _ _ . This optimization makes sure that doesn't cause performance issues.

The exception would only be thrown during the matching process, which seemed to create other issues. We should look into dealing with exceptions during matching better (immediately stopping all matching, not trying to keep going). The unknown function issue is now detected before matching starts.

Improvements: - Saxon indexer doesn't scan the file twice but determines character positions as parsing is happening - xi:include doesn't read the base file into memory. NOTE: you must now set fileTypeOptions.enableXInclude to true in .blf.yaml - FileReference is used to delay deciding how to read a file, e.g. read it into byte[] or use a Reader. - several pre-existing indexing bugs were fixed and found, e.g. using linkedDocuments (OpenSonar/CGN) from a Saxon indexer. Reasoning: How files were handled during indexing was fairly inefficient, e.g. reading the whole file into a byte array, then converting it to a char array, then wrapping in a Reader and parsing using Saxon. This cost memory and CPU time. The code was also quite verbose and repetitive, because various parts of the indexing code had separate cases to deal with files, inputstreams, readers, byte arrays, and so on. Now FileReference is used to abstract an input file that may be in any of those forms. The decision how to actually process the content is delayed until we can actually decide the best way. E.g. VTD indexers request a byte array because that is what VTD uses, but Saxon requests a Reader. All in all, this makes the indexing code simpler, more flexible, and more resource-efficient. Big changes though, so thorough testing is required. === Squashed commit of the following: commit a0abb66 Author: Jan Niestadt <[email protected]> Date: Mon Aug 12 14:41:42 2024 +0200 DocumentReference* rename to XmlDocRef*. commit fd1465f Author: Jan Niestadt <[email protected]> Date: Mon Aug 12 14:35:02 2024 +0200 Bugfixes. enableXInclude option. defaultXmlProcessor flag. After indexing small test datasets with several configs several bugs were fixed. A new fileTypeOptions enableXInclude can be set in the .blf.yaml file so enable our (primitive) xi:include handling. Disabling it by default improves performance a bit. The feature flag defaultXmlProcessor (can be set via env. var or config file) allows you to switch to Saxon being the default XML processor, which we want to do eventually. But perhaps we will only do this as part of a new .blf.yaml format version. commit 17581af Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 12:55:59 2024 +0200 Eliminate withGetTextContent(). commit f301fa0 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 12:34:29 2024 +0200 Eliminate createInputStream(). commit 24ec564 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 11:31:59 2024 +0200 FileReference.getSinglePassReader(). commit 2bd4437 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 11:17:07 2024 +0200 Improvements to FileReference. commit e8df998 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 15:30:52 2024 +0200 Cleanup. commit 5519d03 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 13:57:39 2024 +0200 WIP use FileReference everywhere. commit 59d4d5f Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 12:48:09 2024 +0200 PositionTrackingReader works. commit 941dde0 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 09:54:38 2024 +0200 Small changes to FileReference. commit b90e18e Author: Jan Niestadt <[email protected]> Date: Wed Aug 7 15:24:27 2024 +0200 Add FileReference similar to DocumentReference. commit b06e44e Author: Jan Niestadt <[email protected]> Date: Wed Aug 7 13:50:11 2024 +0200 Work on DocumentReference. XIncludeResolverReader works. commit 48e9183 Author: Jan Niestadt <[email protected]> Date: Tue Aug 6 16:34:36 2024 +0200 Refactoring DocumentReference, XIncludeResolver. WIP. commit 39aac0e Author: Jan Niestadt <[email protected]> Date: Tue Aug 6 13:49:41 2024 +0200 Comment XIncludeResolver classes.

FileProcessor reads files from archives into byte[] by default, because it assumes we'll convert the files in multiple threads, and the InputStream might not be available by the time the file is handled. However, if we're only using 1 thread, we can use the InputStream directly. This means we can read it into a char[] for the Saxon indexer for example, saving the overhead of an InputStreamReader.

char[] often uses about twice as much memory. On the other hand, with byte[], the character decoding overhead applies each time the file is processed, e.g. when parsing XML, and when retrieving the document contents to store in the content store. Right now, we've chosen to prefer byte[] as memory is often a bottleneck during indexing.

Adds rfield(..., fieldName) to BCQL to get the hits from a specific parallel field/version. This is useful to e.g. highlight one of the versions with the hits from a parallel query. Adds searchfield parameter to document contents operation, to indicate that a different field should be used as the main (source) query field. In this case, the query should use the above rfield() function to make sure the resulting hits are in the correct field. E.g. /docs/ID/contents?field=nl&searchfield=en&patt=rfield("dog" ==>nl _, 'nl') Above example will show the Dutch version of a document with translations for the word "dog" highlighted. This is done with a query that will search the en field for dog and find alignments in the nl field, then will only return those nl hits. Squashed commit of the following: commit a51bba3 Author: Jan Niestadt <[email protected]> Date: Mon Aug 19 15:28:28 2024 +0200 searchfield param. rfield() fixes. commit d667890 Author: Jan Niestadt <[email protected]> Date: Mon Aug 19 12:39:58 2024 +0200 Small changes to CharPosTrackingReader. commit 8e2dd4e Author: Jan Niestadt <[email protected]> Date: Fri Aug 16 12:17:38 2024 +0200 Bugfixes, rfield() works. commit 13ecd26 Author: Jan Niestadt <[email protected]> Date: Thu Aug 15 15:39:45 2024 +0200 WIP rfield() function to get hits in other fields.

KCMertens and others added 27 commits November 7, 2023 15:10

Some import migrations

25b9446

Some import migrations

dfc466f

Some import migrations

4566e7b

Update most spans to lucene 9 (update imports, remove extractTerms fu…

618f3c3

…nction)

Merge pull request #1 from INL/dev

ef684af

update

npe

e66a8d9

Merge remote-tracking branch 'origin/dev' into dev

3f52983

Merge branch 'INL:dev' into dev

cf3415f

compiling with ee 10 jdk 17

ec0bc69

1 solr test failing

revert solr module

e7feefe

exclude solr module untill jakarta compatible version is out

Merge branch 'dev' into experiment/tomcat-10

d03bd24

# Conflicts: # engine/pom.xml # engine/src/main/java/nl/inl/blacklab/indexers/config/saxon/XPathFinder.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansCaptureRelationsWithinSpan.java

building after merge in dev

f6f4a46

new solr

3983214

wrong version release plugin

772503e

new lucene

675d81a

new lucene

208af31

versions, dependencies

88c3bda

no try catch needed LoggingWatcher

completing merge

5545da4

correct analyzers version solr.XSLTResponseWriter not found

completing merge

82e1d73

correct analyzers version solr.XSLTResponseWriter not found

unnecessary analyzers version

5896394

solrconfig.xml:

84acdf2

full classname for xsltresponsewriter new lucene version => rebuild index in test resources jakarta needed in solr tests

todo solved

35ee10d

Update Dockerfiles builder base image.

a5f349e

Use Lucene87 codec as default in BlackLab40Codec.

9917246

jan-niestadt marked this pull request as draft July 11, 2024 09:42

Deal with endianness switch.

7ce202e

When reading an index written using Lucene 8, we need to make sure we use EndiannessReverserUtil when opening the file, because Lucene 9 switched to little-endian for DataInput/Output.

jan-niestadt force-pushed the experiment/fa-tomcat-10 branch from 9dfade3 to 7efd318 Compare July 11, 2024 14:01

WIP add BlackLab50 codec for Lucene 9.

7efd318

jan-niestadt mentioned this pull request Jul 15, 2024

old codecs problem in experiment/fa-tomcat-10 #528

Closed

jan-niestadt added 5 commits July 16, 2024 13:32

Fix BlackLab40StoredFieldsReader.

2547ee0

Switch Solr to CaffeineCache.

810dabc

GH actions: use JDK 17.

411f206

Switch Dockerfiles to tomcat:10-jre17.

fb23e91

jan-niestadt changed the title ~~Another attempt to upgrade to Java 17 and Lucene/Solr 9~~ Lucene/Solr 9; Java 17; Tomcat 10 Jul 16, 2024

jan-niestadt and others added 19 commits July 25, 2024 10:10

Merge branch 'dev' into experiment/fa-tomcat-10

daf62c3

Moved classes to make comparing to dev easier.

c2f489d

Some more changes to synch with dev.

4fc9024

Synch with dev.

b4357cf

Sync .md with dev.

dc71ef6

(from dev) Exp. relation attribute storing feature.

147e75d

Prevent overflow when storing large docs in content store.

56a3336

Parallel: fix optional alignment ==>? not always producing source mat…

a11383f

…ches. The optional alignment operator should always include all matches for the source clause (left side of the operator), regardless of whether there's matches for the target clause.

Parallel: fix incorrect character positions.

8425806

Fix exception message.

f604785

Docs: link to Lucene regex; mention <> special chars.

e8a01be

Parallel: fix incorrect character positions.

44f32ba

adapt to lucene version (#534)

1cc741c

jan-niestadt force-pushed the dev branch from eba9ef1 to c7c02b3 Compare January 13, 2025 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lucene/Solr 9; Java 17; Tomcat 10 #526

Lucene/Solr 9; Java 17; Tomcat 10 #526

jan-niestadt commented Jul 11, 2024 •

edited

Loading

jan-niestadt commented Jul 11, 2024

jan-niestadt commented Jul 16, 2024 •

edited

Loading

Lucene/Solr 9; Java 17; Tomcat 10 #526

Are you sure you want to change the base?

Lucene/Solr 9; Java 17; Tomcat 10 #526

Conversation

jan-niestadt commented Jul 11, 2024 • edited Loading

jan-niestadt commented Jul 11, 2024

jan-niestadt commented Jul 16, 2024 • edited Loading

jan-niestadt commented Jul 11, 2024 •

edited

Loading

jan-niestadt commented Jul 16, 2024 •

edited

Loading