Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lucene/Solr 9; Java 17; Tomcat 10 #526

Draft
wants to merge 53 commits into
base: dev
Choose a base branch
from
Draft

Conversation

jan-niestadt
Copy link
Member

@jan-niestadt jan-niestadt commented Jul 11, 2024

Thanks to @eduarddrenth for this branch, which updates our previous experiment and solves more issues.

CURRENT STATUS: working, experimental. Will probably be merged in after releasing v4 soon

Old comments:

Main issue now seems to be that indexes created with Lucene 8 cannot be read by this Lucene 9 version. Normally, Lucene 9 should not have any issue reading Lucene 8 indexes, but our custom Codec probably causes issues with this. You can see this when running the tests, some of which run against a precreated (with Lucene 8) test index.

KCMertens and others added 27 commits November 7, 2023 15:10
This contains the more "interesting" migrations (more than just a removed function or import path)

- poms updated to use latest version of jackson/jersey/xml-bind (might miss a runtime dependency still). Servlet-api still on v4 due to packaged jetty server in SOLR being stuck on v4.
- implement most simple (oneliner) migrations, should be doublechecked
- one remaining compilation error in BLSpanOrQuery that should be solved.
This gives it access to the package-private class used by SpanOrQuery.

Might seem like a hack, but these classes are so intertwined with how
Lucene works internally that they might as well live in the same package.

Arguably most subclasses of BLSpanQuery and BLSpans should be moved to
the same package eventually.

We might need to still double-check that BLSpanTermQuery and BLSpanMultiTermQueryWrapper
are up to date with their Lucene 9 counterparts.
1 solr test failing
exclude solr module untill jakarta compatible version is out
# Conflicts:
#	engine/pom.xml
#	engine/src/main/java/nl/inl/blacklab/indexers/config/saxon/XPathFinder.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpansCaptureRelationsWithinSpan.java
no try catch needed
LoggingWatcher
…riment/tomcat-10

# Conflicts:
#	engine/pom.xml
#	engine/src/main/java/nl/inl/blacklab/codec/BlackLab40StoredFieldsReader.java
#	engine/src/main/java/nl/inl/blacklab/search/SingleDocIdFilter.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpans.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/BLConjunctionSpansInBuckets.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/BLFilterDocsSpans.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanMultiTermQueryWrapper.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/BLSpanQuery.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryAnyToken.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQueryNoHits.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpanQuerySequence.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpansAnd.java
#	engine/src/main/java/nl/inl/blacklab/search/lucene/SpansFiltered.java
#	engine/src/main/java/nl/inl/blacklab/search/results/HitsInternal.java
#	engine/src/main/java/nl/inl/util/DocValuesUtil.java
#	engine/src/main/java/org/apache/lucene/queries/spans/BLSpanOrQuery.java
#	pom.xml
#	proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/ParsePatternResponse.java
#	proxy/jaxb/src/main/java/org/ivdnt/blacklab/proxy/representation/SummaryTextPattern.java
#	solr/pom.xml
#	solr/src/main/java/org/ivdnt/blacklab/solr/DocSetFilter.java
#	wslib/src/main/java/nl/inl/blacklab/server/exceptions/BadRequest.java
#	wslib/src/main/java/nl/inl/blacklab/server/lib/Response.java
correct analyzers version
solr.XSLTResponseWriter not found
correct analyzers version
solr.XSLTResponseWriter not found
full classname for xsltresponsewriter
new lucene version => rebuild index in test resources
jakarta needed in solr tests
@jan-niestadt jan-niestadt marked this pull request as draft July 11, 2024 09:42
@jan-niestadt
Copy link
Member Author

After making sure BlackLab40Codec uses Lucene87 as the delegate codec (previously it requested the default codec from Lucene, which is obvisouly different in Lucene 9), we still have a problem reading back our custom terms file.

The cause seems to be that DataInput/DataOutput have been switched to little-endian. Presumably we need to update our custom codec to explicitly read big-endian, which is how these indexes were written.

We should also add a new codec version, e.g. BlackLab41, that will use the new 9.x Lucene codec as delegate (or even adapts, using the default when creating, and recording what delegate codec was used so it can be reinstantiated when reading later). That version of the codec can use little-endian for the custom terms files as well.

When reading an index written using Lucene 8, we need to
make sure we use EndiannessReverserUtil when opening the file,
because Lucene 9 switched to little-endian for DataInput/Output.
@jan-niestadt jan-niestadt force-pushed the experiment/fa-tomcat-10 branch from 9dfade3 to 7efd318 Compare July 11, 2024 14:01
The previous codec, BlackLab40Codec, can still be used to read
indexes created with BlackLab 4.x / Lucene 8.x, but you cannot
create new corpora or add to existing ones with this codec.

The new (default) BlackLab50Codec is essentially the same, except
it uses Lucene99 as its base codec, including Lucene's changed
endianness.
@jan-niestadt
Copy link
Member Author

jan-niestadt commented Jul 16, 2024

Everything seems to work now. There's a new codec (BlackLab50Codec) that essentially identical to the previous one, except it's based on Lucene99 instead of Lucene87, so you can write indexes with it. Solr cache was switched to CaffeineCache ((Fast)LRU was removed). Dockerfiles have been updated to Tomcat 10.

@jan-niestadt jan-niestadt changed the title Another attempt to upgrade to Java 17 and Lucene/Solr 9 Lucene/Solr 9; Java 17; Tomcat 10 Jul 16, 2024
jan-niestadt and others added 19 commits July 25, 2024 10:10
…ches.

The optional alignment operator should always include all matches for the
source clause (left side of the operator), regardless of whether there's
matches for the target clause.
Before, neither []* []* nor _ _ would be recognized and simplified
to a single []*, which would cause bad performance.

It's easy to mistake _ for meaning "any single token" (actually it's 0
or more tokens), so users are likely to construct parallel queries
like  "de" "koe" ==>en _ _  . This optimization makes sure that doesn't
cause performance issues.
The exception would only be thrown during the matching process, which
seemed to create other issues. We should look into dealing with exceptions
during matching better (immediately stopping all matching, not trying to
keep going).

The unknown function issue is now detected before matching starts.
Improvements:
- Saxon indexer doesn't scan the file twice but determines character
  positions as parsing is happening
- xi:include doesn't read the base file into memory.
  NOTE: you must now set fileTypeOptions.enableXInclude to true in .blf.yaml
- FileReference is used to delay deciding how to read a file, e.g.
  read it into byte[] or use a Reader.
- several pre-existing indexing bugs were fixed and found, e.g.
  using linkedDocuments (OpenSonar/CGN) from a Saxon indexer.

Reasoning:
How files were handled during indexing was fairly inefficient, e.g.
reading the whole file into a byte array, then converting it to a
char array, then wrapping in a Reader and parsing using Saxon.
This cost memory and CPU time.

The code was also quite verbose and repetitive, because various parts
of the indexing code had separate cases to deal with files, inputstreams,
readers, byte arrays, and so on.

Now FileReference is used to abstract an input file that may be in any
of those forms. The decision how to actually process the content is
delayed until we can actually decide the best way. E.g. VTD indexers
request a byte array because that is what VTD uses, but Saxon requests
a Reader.

All in all, this makes the indexing code simpler, more flexible,
and more resource-efficient. Big changes though, so thorough
testing is required.

===

Squashed commit of the following:

commit a0abb66
Author: Jan Niestadt <[email protected]>
Date:   Mon Aug 12 14:41:42 2024 +0200

    DocumentReference* rename to XmlDocRef*.

commit fd1465f
Author: Jan Niestadt <[email protected]>
Date:   Mon Aug 12 14:35:02 2024 +0200

    Bugfixes. enableXInclude option. defaultXmlProcessor flag.

    After indexing small test datasets with several configs several
    bugs were fixed.

    A new fileTypeOptions enableXInclude can be set in the .blf.yaml
    file so enable our (primitive) xi:include handling. Disabling it by
    default improves performance a bit.

    The feature flag defaultXmlProcessor (can be set via env. var or
    config file) allows you to switch to Saxon being the default XML
    processor, which we want to do eventually. But perhaps we will only
    do this as part of a new .blf.yaml format version.

commit 17581af
Author: Jan Niestadt <[email protected]>
Date:   Fri Aug 9 12:55:59 2024 +0200

    Eliminate withGetTextContent().

commit f301fa0
Author: Jan Niestadt <[email protected]>
Date:   Fri Aug 9 12:34:29 2024 +0200

    Eliminate createInputStream().

commit 24ec564
Author: Jan Niestadt <[email protected]>
Date:   Fri Aug 9 11:31:59 2024 +0200

    FileReference.getSinglePassReader().

commit 2bd4437
Author: Jan Niestadt <[email protected]>
Date:   Fri Aug 9 11:17:07 2024 +0200

    Improvements to FileReference.

commit e8df998
Author: Jan Niestadt <[email protected]>
Date:   Thu Aug 8 15:30:52 2024 +0200

    Cleanup.

commit 5519d03
Author: Jan Niestadt <[email protected]>
Date:   Thu Aug 8 13:57:39 2024 +0200

    WIP use FileReference everywhere.

commit 59d4d5f
Author: Jan Niestadt <[email protected]>
Date:   Thu Aug 8 12:48:09 2024 +0200

    PositionTrackingReader works.

commit 941dde0
Author: Jan Niestadt <[email protected]>
Date:   Thu Aug 8 09:54:38 2024 +0200

    Small changes to FileReference.

commit b90e18e
Author: Jan Niestadt <[email protected]>
Date:   Wed Aug 7 15:24:27 2024 +0200

    Add FileReference similar to DocumentReference.

commit b06e44e
Author: Jan Niestadt <[email protected]>
Date:   Wed Aug 7 13:50:11 2024 +0200

    Work on DocumentReference. XIncludeResolverReader works.

commit 48e9183
Author: Jan Niestadt <[email protected]>
Date:   Tue Aug 6 16:34:36 2024 +0200

    Refactoring DocumentReference, XIncludeResolver. WIP.

commit 39aac0e
Author: Jan Niestadt <[email protected]>
Date:   Tue Aug 6 13:49:41 2024 +0200

    Comment XIncludeResolver classes.
FileProcessor reads files from archives into byte[] by default, because
it assumes we'll convert the files in multiple threads, and the InputStream
might not be available by the time the file is handled.

However, if we're only using 1 thread, we can use the InputStream directly.
This means we can read it into a char[] for the Saxon indexer for example,
saving the overhead of an InputStreamReader.
char[] often uses about twice as much memory. On the other hand,
with byte[], the character decoding overhead applies each time the
file is processed, e.g. when parsing XML, and when retrieving the
document contents to store in the content store.

Right now, we've chosen to prefer byte[] as memory is often a
bottleneck during indexing.
Adds rfield(..., fieldName) to BCQL to get the hits from a
specific parallel field/version. This is useful to e.g. highlight
one of the versions with the hits from a parallel query.

Adds searchfield parameter to document contents operation,
to indicate that a different field should be used as the
main (source) query field. In this case, the query should use the
above rfield() function to make sure the resulting hits are
in the correct field.

E.g. /docs/ID/contents?field=nl&searchfield=en&patt=rfield("dog" ==>nl _, 'nl')

Above example will show the Dutch version of a document with translations
for the word "dog" highlighted. This is done with a query that will search the
en field for dog and find alignments in the nl field, then will only return those
nl hits.

Squashed commit of the following:

commit a51bba3
Author: Jan Niestadt <[email protected]>
Date:   Mon Aug 19 15:28:28 2024 +0200

    searchfield param. rfield() fixes.

commit d667890
Author: Jan Niestadt <[email protected]>
Date:   Mon Aug 19 12:39:58 2024 +0200

    Small changes to CharPosTrackingReader.

commit 8e2dd4e
Author: Jan Niestadt <[email protected]>
Date:   Fri Aug 16 12:17:38 2024 +0200

    Bugfixes, rfield() works.

commit 13ecd26
Author: Jan Niestadt <[email protected]>
Date:   Thu Aug 15 15:39:45 2024 +0200

    WIP rfield() function to get hits in other fields.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants