-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lucene/Solr 9; Java 17; Tomcat 10 #526
base: dev
Are you sure you want to change the base?
Conversation
This contains the more "interesting" migrations (more than just a removed function or import path) - poms updated to use latest version of jackson/jersey/xml-bind (might miss a runtime dependency still). Servlet-api still on v4 due to packaged jetty server in SOLR being stuck on v4. - implement most simple (oneliner) migrations, should be doublechecked - one remaining compilation error in BLSpanOrQuery that should be solved.
This gives it access to the package-private class used by SpanOrQuery. Might seem like a hack, but these classes are so intertwined with how Lucene works internally that they might as well live in the same package. Arguably most subclasses of BLSpanQuery and BLSpans should be moved to the same package eventually. We might need to still double-check that BLSpanTermQuery and BLSpanMultiTermQueryWrapper are up to date with their Lucene 9 counterparts.
update
1 solr test failing
exclude solr module untill jakarta compatible version is out
# Conflicts: # engine/pom.xml # engine/src/main/java/nl/inl/blacklab/indexers/config/saxon/XPathFinder.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansCaptureRelationsWithinSpan.java
no try catch needed LoggingWatcher
…riment/tomcat-10 # Conflicts: # engine/pom.xml # engine/src/main/java/nl/inl/blacklab/codec/BlackLab40StoredFieldsReader.java # engine/src/main/java/nl/inl/blacklab/search/SingleDocIdFilter.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpans.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpansInBuckets.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLFilterDocsSpans.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanMultiTermQueryWrapper.java # engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanQuery.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryAnyToken.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryNoHits.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQuerySequence.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansAnd.java # engine/src/main/java/nl/inl/blacklab/search/lucene/SpansFiltered.java # engine/src/main/java/nl/inl/blacklab/search/results/HitsInternal.java # engine/src/main/java/nl/inl/util/DocValuesUtil.java # engine/src/main/java/org/apache/lucene/queries/spans/BLSpanOrQuery.java # pom.xml # proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/ParsePatternResponse.java # proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/SummaryTextPattern.java # solr/pom.xml # solr/src/main/java/org/ivdnt/blacklab/solr/DocSetFilter.java # wslib/src/main/java/nl/inl/blacklab/server/exceptions/BadRequest.java # wslib/src/main/java/nl/inl/blacklab/server/lib/Response.java
correct analyzers version solr.XSLTResponseWriter not found
correct analyzers version solr.XSLTResponseWriter not found
full classname for xsltresponsewriter new lucene version => rebuild index in test resources jakarta needed in solr tests
After making sure BlackLab40Codec uses Lucene87 as the delegate codec (previously it requested the default codec from Lucene, which is obvisouly different in Lucene 9), we still have a problem reading back our custom terms file. The cause seems to be that DataInput/DataOutput have been switched to little-endian. Presumably we need to update our custom codec to explicitly read big-endian, which is how these indexes were written. We should also add a new codec version, e.g. BlackLab41, that will use the new 9.x Lucene codec as delegate (or even adapts, using the default when creating, and recording what delegate codec was used so it can be reinstantiated when reading later). That version of the codec can use little-endian for the custom terms files as well. |
When reading an index written using Lucene 8, we need to make sure we use EndiannessReverserUtil when opening the file, because Lucene 9 switched to little-endian for DataInput/Output.
9dfade3
to
7efd318
Compare
The previous codec, BlackLab40Codec, can still be used to read indexes created with BlackLab 4.x / Lucene 8.x, but you cannot create new corpora or add to existing ones with this codec. The new (default) BlackLab50Codec is essentially the same, except it uses Lucene99 as its base codec, including Lucene's changed endianness.
Everything seems to work now. There's a new codec (BlackLab50Codec) that essentially identical to the previous one, except it's based on Lucene99 instead of Lucene87, so you can write indexes with it. Solr cache was switched to CaffeineCache ((Fast)LRU was removed). Dockerfiles have been updated to Tomcat 10. |
…ches. The optional alignment operator should always include all matches for the source clause (left side of the operator), regardless of whether there's matches for the target clause.
Before, neither []* []* nor _ _ would be recognized and simplified to a single []*, which would cause bad performance. It's easy to mistake _ for meaning "any single token" (actually it's 0 or more tokens), so users are likely to construct parallel queries like "de" "koe" ==>en _ _ . This optimization makes sure that doesn't cause performance issues.
The exception would only be thrown during the matching process, which seemed to create other issues. We should look into dealing with exceptions during matching better (immediately stopping all matching, not trying to keep going). The unknown function issue is now detected before matching starts.
Improvements: - Saxon indexer doesn't scan the file twice but determines character positions as parsing is happening - xi:include doesn't read the base file into memory. NOTE: you must now set fileTypeOptions.enableXInclude to true in .blf.yaml - FileReference is used to delay deciding how to read a file, e.g. read it into byte[] or use a Reader. - several pre-existing indexing bugs were fixed and found, e.g. using linkedDocuments (OpenSonar/CGN) from a Saxon indexer. Reasoning: How files were handled during indexing was fairly inefficient, e.g. reading the whole file into a byte array, then converting it to a char array, then wrapping in a Reader and parsing using Saxon. This cost memory and CPU time. The code was also quite verbose and repetitive, because various parts of the indexing code had separate cases to deal with files, inputstreams, readers, byte arrays, and so on. Now FileReference is used to abstract an input file that may be in any of those forms. The decision how to actually process the content is delayed until we can actually decide the best way. E.g. VTD indexers request a byte array because that is what VTD uses, but Saxon requests a Reader. All in all, this makes the indexing code simpler, more flexible, and more resource-efficient. Big changes though, so thorough testing is required. === Squashed commit of the following: commit a0abb66 Author: Jan Niestadt <[email protected]> Date: Mon Aug 12 14:41:42 2024 +0200 DocumentReference* rename to XmlDocRef*. commit fd1465f Author: Jan Niestadt <[email protected]> Date: Mon Aug 12 14:35:02 2024 +0200 Bugfixes. enableXInclude option. defaultXmlProcessor flag. After indexing small test datasets with several configs several bugs were fixed. A new fileTypeOptions enableXInclude can be set in the .blf.yaml file so enable our (primitive) xi:include handling. Disabling it by default improves performance a bit. The feature flag defaultXmlProcessor (can be set via env. var or config file) allows you to switch to Saxon being the default XML processor, which we want to do eventually. But perhaps we will only do this as part of a new .blf.yaml format version. commit 17581af Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 12:55:59 2024 +0200 Eliminate withGetTextContent(). commit f301fa0 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 12:34:29 2024 +0200 Eliminate createInputStream(). commit 24ec564 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 11:31:59 2024 +0200 FileReference.getSinglePassReader(). commit 2bd4437 Author: Jan Niestadt <[email protected]> Date: Fri Aug 9 11:17:07 2024 +0200 Improvements to FileReference. commit e8df998 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 15:30:52 2024 +0200 Cleanup. commit 5519d03 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 13:57:39 2024 +0200 WIP use FileReference everywhere. commit 59d4d5f Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 12:48:09 2024 +0200 PositionTrackingReader works. commit 941dde0 Author: Jan Niestadt <[email protected]> Date: Thu Aug 8 09:54:38 2024 +0200 Small changes to FileReference. commit b90e18e Author: Jan Niestadt <[email protected]> Date: Wed Aug 7 15:24:27 2024 +0200 Add FileReference similar to DocumentReference. commit b06e44e Author: Jan Niestadt <[email protected]> Date: Wed Aug 7 13:50:11 2024 +0200 Work on DocumentReference. XIncludeResolverReader works. commit 48e9183 Author: Jan Niestadt <[email protected]> Date: Tue Aug 6 16:34:36 2024 +0200 Refactoring DocumentReference, XIncludeResolver. WIP. commit 39aac0e Author: Jan Niestadt <[email protected]> Date: Tue Aug 6 13:49:41 2024 +0200 Comment XIncludeResolver classes.
FileProcessor reads files from archives into byte[] by default, because it assumes we'll convert the files in multiple threads, and the InputStream might not be available by the time the file is handled. However, if we're only using 1 thread, we can use the InputStream directly. This means we can read it into a char[] for the Saxon indexer for example, saving the overhead of an InputStreamReader.
char[] often uses about twice as much memory. On the other hand, with byte[], the character decoding overhead applies each time the file is processed, e.g. when parsing XML, and when retrieving the document contents to store in the content store. Right now, we've chosen to prefer byte[] as memory is often a bottleneck during indexing.
Adds rfield(..., fieldName) to BCQL to get the hits from a specific parallel field/version. This is useful to e.g. highlight one of the versions with the hits from a parallel query. Adds searchfield parameter to document contents operation, to indicate that a different field should be used as the main (source) query field. In this case, the query should use the above rfield() function to make sure the resulting hits are in the correct field. E.g. /docs/ID/contents?field=nl&searchfield=en&patt=rfield("dog" ==>nl _, 'nl') Above example will show the Dutch version of a document with translations for the word "dog" highlighted. This is done with a query that will search the en field for dog and find alignments in the nl field, then will only return those nl hits. Squashed commit of the following: commit a51bba3 Author: Jan Niestadt <[email protected]> Date: Mon Aug 19 15:28:28 2024 +0200 searchfield param. rfield() fixes. commit d667890 Author: Jan Niestadt <[email protected]> Date: Mon Aug 19 12:39:58 2024 +0200 Small changes to CharPosTrackingReader. commit 8e2dd4e Author: Jan Niestadt <[email protected]> Date: Fri Aug 16 12:17:38 2024 +0200 Bugfixes, rfield() works. commit 13ecd26 Author: Jan Niestadt <[email protected]> Date: Thu Aug 15 15:39:45 2024 +0200 WIP rfield() function to get hits in other fields.
Thanks to @eduarddrenth for this branch, which updates our previous experiment and solves more issues.
CURRENT STATUS: working, experimental. Will probably be merged in after releasing v4 soon
Old comments: