Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to ElasticSearch and drop Solr/MongoDB #33

Open
wants to merge 157 commits into
base: develop
Choose a base branch
from
Open

Conversation

markwoodhall
Copy link
Contributor

@markwoodhall markwoodhall commented May 11, 2018

WIP PR

Purpose

This pull request migrates away from Solr and MongoDB to ElasticSearch.

Highlights

  1. Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:

    (def index-settings
      {"work"     {:number_of_shards 1  :number_of_replicas 3}
       "member"   {:number_of_shards 1  :number_of_replicas 3}
       "funder"   {:number_of_shards 1  :number_of_replicas 3}
       "subject"  {:number_of_shards 1  :number_of_replicas 3}
       "coverage" {:number_of_shards 1  :number_of_replicas 3}
       "journal"  {:number_of_shards 1  :number_of_replicas 3}})
  2. The configuration for docker-compose has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removed

  3. A new "corpus test" has been created see cayenne.corpus-test. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found here

Original DOI Matched DOI Elastic Solr
10.1002/erv.2485 10.1002/erv.2485 100.273125 90.04283
10.1002/jnr.23820 10.1002/jnr.23820 81.95269 61.502754
10.1002/jnr.23992 10.1002/jnr.23992 104.20477 88.88707
10.1002/nur.21773 10.1002/nur.21773 95.3643 85.251755
10.1007/s00125-016-4154-6 10.1007/s00125-016-4154-6 92.42219 83.01552
10.1007/s00213-016-4480-x 10.1007/s00213-016-4480-x 93.45497 76.72064
10.1007/s10964-016-0591-2 10.1007/s10964-016-0591-2 92.13747 82.10659
10.1007/s11302-016-9551-2 10.1007/s11302-016-9551-2 96.21273 75.00612
10.1007/s11682-016-9638-y 10.1007/s11682-016-9638-y 100.7581 86.417206
10.1007/s13318-016-0388-4 10.1007/s13318-016-0388-4 120.53948 105.28446
10.1016/j.alcohol.2016.08.008 10.1016/j.alcohol.2016.08.008 90.90022 78.548706
10.1016/j.bbi.2016.10.007 10.1016/j.bbi.2016.10.007 91.129654 84.56702
10.1016/j.bbr.2016.10.035 10.1016/j.bbr.2016.10.035 101.12494 90.45204
10.1016/j.biopsycho.2016.12.010 10.1016/j.biopsycho.2016.12.010 88.34703 75.23803
10.1016/j.bmc.2016.10.035 10.1016/j.bmc.2016.10.035 110.07812 94.04622
10.1016/j.explore.2016.10.009 10.1016/j.explore.2016.10.009 85.96247 69.95195
10.1016/j.infbeh.2016.09.006 10.1016/j.infbeh.2016.09.006 100.5378 86.74484
10.1016/j.jad.2016.10.035 10.1016/j.jad.2016.10.035 61.279423 53.282127
10.1016/j.jad.2016.11.036 10.1016/j.jad.2016.11.036 90.15741 81.84434
10.1016/j.jad.2016.11.046 10.1016/j.jad.2016.11.046 123.41971 103.81766
10.1016/j.joms.2016.10.033 10.1016/j.joms.2016.10.033 85.54165 76.20327
10.1016/j.neubiorev.2016.12.003 10.1016/j.neubiorev.2016.12.003 75.98824 72.51518
10.1016/j.neubiorev.2016.12.006 10.1016/j.neubiorev.2016.12.006 117.57448 91.75167
10.1016/j.neubiorev.2016.12.013 10.1016/j.neubiorev.2016.12.013 97.39974 87.65282
10.1016/j.neulet.2016.11.064 10.1016/j.neulet.2016.11.064 108.19604 93.43173
10.1016/j.neuro.2016.11.006 10.1016/j.neuro.2016.11.006 97.35805 82.715225
10.1016/j.neurobiolaging.2016.11.014 10.1016/j.neurobiolaging.2016.11.014 100.97907 89.42957
10.1016/j.neuroimage.2016.12.046 10.1016/j.neuroimage.2016.12.046 85.69545 74.715836
10.1016/j.neuron.2016.09.039 10.1016/j.neuron.2016.09.039 67.21429 61.09139
10.1016/j.nicl.2016.11.014 10.1016/j.nicl.2016.11.014 86.846924 76.55675
10.1016/j.nlm.2016.10.006 10.1016/j.nlm.2016.10.006 106.93617 95.132774
10.1016/j.nlm.2016.11.008 10.1016/j.nlm.2016.11.008 65.49764 58.382977
10.1016/j.peptides.2016.11.001 10.1016/j.peptides.2016.11.001 97.45492 82.56204
10.1016/j.physbeh.2016.10.010 10.1016/j.physbeh.2016.10.010 87.95236 70.85148
10.1016/j.physbeh.2016.11.030 10.1016/j.physbeh.2016.11.030 99.73418 85.511086
10.1016/j.physbeh.2016.12.004 10.1016/j.physbeh.2016.12.004 116.00987 93.99558
  1. There has been a lot of "code clean up", most of this was done in the early phase of the elastic branch, a few things worth mentioning that have been removed:
  • OAI harvester
  • Datomic-backed graph API
  • HTML landing page interrogation
  • Datacite XML parser
  • DOI metadata quality checker
  • Web of Knowledge parser
  • Resolution URL checker
  • Citation analysis
  • DOAJ code
  • Old patent deposit code (now handled by event data)
  • Deposits API
  • /licenses route (in favour of license facet)
  • Old code for citation checking
  1. Index settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment

    (def index-settings
      {"work"     {:number_of_shards 1  :number_of_replicas 3}
       ...})

    There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus

Index Structures

Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.

  1. Change year to be non numeric here. The reasons for this are explained in the commit message.
  2. I also ported mappings required for new master features. e.g. peer reviews, isbn types

Concerns

The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.

As per the logic for a works resource we should extract the long doi.

This commit also includes minor updates to the transform assertion data.
This commit also includes a regeneration of the journal based assertion date for minor changes
Since we now have corpus tests to very scoring across citation matching we can get away without asserting against score for all other works related tests, this is helpful because score does vary per test run using the elastic implementation.
As per other indexes we will use one shard for the work index, this also happens to be the most similar to the existing solr setup.
Since the elastic version has no transformation of sub types we should use common versions during the index phase.
It is possible to just call index-journals so load-test-journals is a little redundant
Set cr-funder-registry at the first available opportunity, port fix to enable starting the core only once per process, rather than once per test fixture.
@markwoodhall markwoodhall changed the title Elastic Move to ElasticSearch and drop Solr/MongoDB May 11, 2018
MikeYalter and others added 9 commits July 11, 2018 07:49
The indexing was failing due to a self reference in the ingest RDF
file. Once that was fixed, the funder route for works was broken e.g.:
/funders/100006151/works
test now tests for the funders/####/works route
query.clj change to assoc-in is unnecessary.
In order to make this work and avoid excess mapping explosion I have changed the underlying structure of the coverage index so that coverage by type is actually indexed, from this we can calculate an overview of the coverage, coverage counts by type, and coverage type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants