-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to ElasticSearch and drop Solr/MongoDB #33
Open
markwoodhall
wants to merge
157
commits into
develop
Choose a base branch
from
elastic
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
As per the logic for a works resource we should extract the long doi. This commit also includes minor updates to the transform assertion data.
This commit also includes a regeneration of the journal based assertion date for minor changes
Since we now have corpus tests to very scoring across citation matching we can get away without asserting against score for all other works related tests, this is helpful because score does vary per test run using the elastic implementation.
As per other indexes we will use one shard for the work index, this also happens to be the most similar to the existing solr setup.
Since the elastic version has no transformation of sub types we should use common versions during the index phase.
It is possible to just call index-journals so load-test-journals is a little redundant
Set cr-funder-registry at the first available opportunity, port fix to enable starting the core only once per process, rather than once per test fixture.
The indexing was failing due to a self reference in the ingest RDF file. Once that was fixed, the funder route for works was broken e.g.: /funders/100006151/works
test now tests for the funders/####/works route
query.clj change to assoc-in is unnecessary.
issue-36 funder route
In order to make this work and avoid excess mapping explosion I have changed the underlying structure of the coverage index so that coverage by type is actually indexed, from this we can calculate an overview of the coverage, coverage counts by type, and coverage type.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
WIP PR
Purpose
This pull request migrates away from Solr and MongoDB to ElasticSearch.
Highlights
Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:
The configuration for
docker-compose
has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removedA new "corpus test" has been created see
cayenne.corpus-test
. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found hereIndex settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus
Index Structures
Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.
year
to be non numeric here. The reasons for this are explained in the commit message.Concerns
The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.