Move to ElasticSearch and drop Solr/MongoDB #33

markwoodhall · 2018-05-11T09:03:34Z

WIP PR

Purpose

This pull request migrates away from Solr and MongoDB to ElasticSearch.

Highlights

Solr and MongoDB have been removed in favour of ElasticSearch for all data storage. ElasticSearch indexes exist for all of the core data types:

(def index-settings
  {"work"     {:number_of_shards 1  :number_of_replicas 3}
   "member"   {:number_of_shards 1  :number_of_replicas 3}
   "funder"   {:number_of_shards 1  :number_of_replicas 3}
   "subject"  {:number_of_shards 1  :number_of_replicas 3}
   "coverage" {:number_of_shards 1  :number_of_replicas 3}
   "journal"  {:number_of_shards 1  :number_of_replicas 3}})

The configuration for docker-compose has been adjusted to start ElasticSearch, all references to Solr and MongoDB have been removed
A new "corpus test" has been created see cayenne.corpus-test. This test can work against a corpus of varying size and proves that citation matching is working within a known threshold. An almost identical version of this test is included in a PR to Solr version so a direct comparison can be made. I've attached a scoring comparison of citation matching below. A more complete comparison can be found here

Original DOI	Matched DOI	Elastic	Solr
10.1002/erv.2485	10.1002/erv.2485	100.273125	90.04283
10.1002/jnr.23820	10.1002/jnr.23820	81.95269	61.502754
10.1002/jnr.23992	10.1002/jnr.23992	104.20477	88.88707
10.1002/nur.21773	10.1002/nur.21773	95.3643	85.251755
10.1007/s00125-016-4154-6	10.1007/s00125-016-4154-6	92.42219	83.01552
10.1007/s00213-016-4480-x	10.1007/s00213-016-4480-x	93.45497	76.72064
10.1007/s10964-016-0591-2	10.1007/s10964-016-0591-2	92.13747	82.10659
10.1007/s11302-016-9551-2	10.1007/s11302-016-9551-2	96.21273	75.00612
10.1007/s11682-016-9638-y	10.1007/s11682-016-9638-y	100.7581	86.417206
10.1007/s13318-016-0388-4	10.1007/s13318-016-0388-4	120.53948	105.28446
10.1016/j.alcohol.2016.08.008	10.1016/j.alcohol.2016.08.008	90.90022	78.548706
10.1016/j.bbi.2016.10.007	10.1016/j.bbi.2016.10.007	91.129654	84.56702
10.1016/j.bbr.2016.10.035	10.1016/j.bbr.2016.10.035	101.12494	90.45204
10.1016/j.biopsycho.2016.12.010	10.1016/j.biopsycho.2016.12.010	88.34703	75.23803
10.1016/j.bmc.2016.10.035	10.1016/j.bmc.2016.10.035	110.07812	94.04622
10.1016/j.explore.2016.10.009	10.1016/j.explore.2016.10.009	85.96247	69.95195
10.1016/j.infbeh.2016.09.006	10.1016/j.infbeh.2016.09.006	100.5378	86.74484
10.1016/j.jad.2016.10.035	10.1016/j.jad.2016.10.035	61.279423	53.282127
10.1016/j.jad.2016.11.036	10.1016/j.jad.2016.11.036	90.15741	81.84434
10.1016/j.jad.2016.11.046	10.1016/j.jad.2016.11.046	123.41971	103.81766
10.1016/j.joms.2016.10.033	10.1016/j.joms.2016.10.033	85.54165	76.20327
10.1016/j.neubiorev.2016.12.003	10.1016/j.neubiorev.2016.12.003	75.98824	72.51518
10.1016/j.neubiorev.2016.12.006	10.1016/j.neubiorev.2016.12.006	117.57448	91.75167
10.1016/j.neubiorev.2016.12.013	10.1016/j.neubiorev.2016.12.013	97.39974	87.65282
10.1016/j.neulet.2016.11.064	10.1016/j.neulet.2016.11.064	108.19604	93.43173
10.1016/j.neuro.2016.11.006	10.1016/j.neuro.2016.11.006	97.35805	82.715225
10.1016/j.neurobiolaging.2016.11.014	10.1016/j.neurobiolaging.2016.11.014	100.97907	89.42957
10.1016/j.neuroimage.2016.12.046	10.1016/j.neuroimage.2016.12.046	85.69545	74.715836
10.1016/j.neuron.2016.09.039	10.1016/j.neuron.2016.09.039	67.21429	61.09139
10.1016/j.nicl.2016.11.014	10.1016/j.nicl.2016.11.014	86.846924	76.55675
10.1016/j.nlm.2016.10.006	10.1016/j.nlm.2016.10.006	106.93617	95.132774
10.1016/j.nlm.2016.11.008	10.1016/j.nlm.2016.11.008	65.49764	58.382977
10.1016/j.peptides.2016.11.001	10.1016/j.peptides.2016.11.001	97.45492	82.56204
10.1016/j.physbeh.2016.10.010	10.1016/j.physbeh.2016.10.010	87.95236	70.85148
10.1016/j.physbeh.2016.11.030	10.1016/j.physbeh.2016.11.030	99.73418	85.511086
10.1016/j.physbeh.2016.12.004	10.1016/j.physbeh.2016.12.004	116.00987	93.99558

There has been a lot of "code clean up", most of this was done in the early phase of the elastic branch, a few things worth mentioning that have been removed:

OAI harvester
Datomic-backed graph API
HTML landing page interrogation
Datacite XML parser
DOI metadata quality checker
Web of Knowledge parser
Resolution URL checker
Citation analysis
DOAJ code
Old patent deposit code (now handled by event data)
Deposits API
/licenses route (in favour of license facet)
Old code for citation checking

Index settings are configured to closely match Solr, particularly the number of shards used by the work index matches with the Solr production deployment
```
(def index-settings
  {"work"     {:number_of_shards 1  :number_of_replicas 3}
   ...})
```
There is scope to change this in the future but it is worth keeping in mind that scoring is shard local, so the number of shards directly impacts scoring, in theory this should even out over a large enough corpus

Index Structures

Much of the underlying structure for index was already in place in the elastic branch, I have only made changes to this structure where it fixed an issue.

Change year to be non numeric here. The reasons for this are explained in the commit message.
I also ported mappings required for new master features. e.g. peer reviews, isbn types

Concerns

The changes in this PR are somewhat wider ranging than just swapping in ElasticSearch, as the highlights above show there has been a general clean up and removal of "old code". A large portion of functionality is proven by the passing of existing high level automated tests, however, there may be untested areas which will require testing after deployment.

As per the logic for a works resource we should extract the long doi. This commit also includes minor updates to the transform assertion data.

This commit also includes a regeneration of the journal based assertion date for minor changes

Since we now have corpus tests to very scoring across citation matching we can get away without asserting against score for all other works related tests, this is helpful because score does vary per test run using the elastic implementation.

As per other indexes we will use one shard for the work index, this also happens to be the most similar to the existing solr setup.

Since the elastic version has no transformation of sub types we should use common versions during the index phase.

It is possible to just call index-journals so load-test-journals is a little redundant

Set cr-funder-registry at the first available opportunity, port fix to enable starting the core only once per process, rather than once per test fixture.

The indexing was failing due to a self reference in the ingest RDF file. Once that was fixed, the funder route for works was broken e.g.: /funders/100006151/works

test now tests for the funders/####/works route

query.clj change to assoc-in is unnecessary.

issue-36 funder route

In order to make this work and avoid excess mapping explosion I have changed the underlying structure of the coverage index so that coverage by type is actually indexed, from this we can calculate an overview of the coverage, coverage counts by type, and coverage type.

kmeddings added the in progress label May 11, 2018

kjw added 29 commits May 11, 2018 10:05

Start of structure for elastic indexing

4c2269e

Switch to one type per index, for ES 6+ compatibility

96bfe23

Use keyword and text instead of string type, for ES 5+ compatibility

fe52879

Disable _all field (not present in ES 6+ anyway)

8bde6e2

More complete mappings

bc6a595

Rework member loading to load into ES

24241c9

index-publishers renamed to index-members

832f59b

Index journals into ES

402d2fb

Remove some fields from category definition

0423a02

Index subjects into ES

4009eac

Rewrite task entry points and user functions

ac496ac

Start of the ES work indexer

a0d9a2e

Connect parse actions to ES indexing

022928a

More complete work indexing for ES

560b698

Mostly complete ES work indexer

d14bfa4

Complete ES work indexer

b6bc1ff

Include unstructured field in references and various indexer fixes

85f11fd

Insert work documents into ES

2807569

Basic queries and DOI look up via elastic search

0a20fe8

Basic filters for ES queries

8795ce4

Mostly working ES filters, dates in ES to citeproc convertion

0fce1c8

Nested filters for nested document fields

037d1cd

Additional metadata in ES doc to citeproc conversion

17bb9ca

Add journal id to work mapping

3638f7e

Working select parameter for ES

da4be75

Sorting for ES queries and re-enable debug parameter

060b534

Report elastic client config on debug=true

ecb1344

Port work filters to ES

440897a

Working work facets in ES

e020d6c

markwoodhall added 15 commits May 11, 2018 10:12

Update works assertion data

2bab369

Extract long doi for transform resources

f38ad2e

As per the logic for a works resource we should extract the long doi. This commit also includes minor updates to the transform assertion data.

expose issn type as :type not :kind

74e1441

This commit also includes a regeneration of the journal based assertion date for minor changes

Use one shard for work index

4d966e4

As per other indexes we will use one shard for the work index, this also happens to be the most similar to the existing solr setup.

Add helper function to pretty print resource to path

8764d79

Remove score comparison from journal work test

a0094ac

Add breakdowns to member coverage

0beee15

Update subtypes for books

09832ad

Since the elastic version has no transformation of sub types we should use common versions during the index phase.

Ignore score during member works test

013a334

Update assertion data to reflect isbn and type fixes

5fcf8d4

Remove load-test-journals

06bd6b9

It is possible to just call index-journals so load-test-journals is a little redundant

Set cr-funder-registry on start

52d6b43

Set cr-funder-registry at the first available opportunity, port fix to enable starting the core only once per process, rather than once per test fixture.

Delay to give coverage time to index

50322a5

Update test corpus after rebase

4354aaa

markwoodhall force-pushed the elastic branch from df78d13 to 4354aaa Compare May 11, 2018 09:19

Remove solr and outdated test documentation from readme

3e1454c

markwoodhall changed the title ~~Elastic~~ Move to ElasticSearch and drop Solr/MongoDB May 11, 2018

Fix typo

aabdb84

pdavis8 assigned ppolischuk Jul 4, 2018

pdavis8 added the deployment label Jul 4, 2018

MikeYalter and others added 9 commits July 11, 2018 07:49

issue-36 funder route

b29dca4

The indexing was failing due to a self reference in the ingest RDF file. Once that was fixed, the funder route for works was broken e.g.: /funders/100006151/works

allow nested multiple filters

ef97a66

added tests to funders-test

7c49dd8

test now tests for the funders/####/works route

comments and revert

104c100

query.clj change to assoc-in is unnecessary.

Add initial count-types implementation as per master PREP 55

5f34ea5

added some comments

a740615

Merge pull request #42 from CrossRef/funder-route

7294966

issue-36 funder route

Fix failing funder works test

56c4bc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to ElasticSearch and drop Solr/MongoDB #33

Move to ElasticSearch and drop Solr/MongoDB #33

markwoodhall commented May 11, 2018 •

edited

Loading

Move to ElasticSearch and drop Solr/MongoDB #33

Are you sure you want to change the base?

Move to ElasticSearch and drop Solr/MongoDB #33

Conversation

markwoodhall commented May 11, 2018 • edited Loading

WIP PR

Purpose

Highlights

Index Structures

Concerns

markwoodhall commented May 11, 2018 •

edited

Loading