Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to ElasticSearch and drop Solr/MongoDB #33

Open
wants to merge 157 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
157 commits
Select commit Hold shift + click to select a range
4c2269e
Start of structure for elastic indexing
kjw Oct 13, 2017
96bfe23
Switch to one type per index, for ES 6+ compatibility
kjw Oct 16, 2017
fe52879
Use keyword and text instead of string type, for ES 5+ compatibility
kjw Oct 16, 2017
8bde6e2
Disable _all field (not present in ES 6+ anyway)
kjw Oct 17, 2017
bc6a595
More complete mappings
kjw Oct 18, 2017
24241c9
Rework member loading to load into ES
kjw Oct 19, 2017
832f59b
index-publishers renamed to index-members
kjw Oct 19, 2017
402d2fb
Index journals into ES
kjw Oct 20, 2017
0423a02
Remove some fields from category definition
kjw Oct 27, 2017
4009eac
Index subjects into ES
kjw Oct 27, 2017
ac496ac
Rewrite task entry points and user functions
kjw Oct 27, 2017
a0d9a2e
Start of the ES work indexer
kjw Oct 30, 2017
022928a
Connect parse actions to ES indexing
kjw Oct 30, 2017
560b698
More complete work indexing for ES
kjw Oct 30, 2017
d14bfa4
Mostly complete ES work indexer
kjw Oct 31, 2017
b6bc1ff
Complete ES work indexer
kjw Oct 31, 2017
85f11fd
Include unstructured field in references and various indexer fixes
kjw Oct 31, 2017
2807569
Insert work documents into ES
kjw Oct 31, 2017
0a20fe8
Basic queries and DOI look up via elastic search
kjw Oct 31, 2017
8795ce4
Basic filters for ES queries
kjw Nov 1, 2017
0fce1c8
Mostly working ES filters, dates in ES to citeproc convertion
kjw Nov 1, 2017
037d1cd
Nested filters for nested document fields
kjw Nov 2, 2017
17bb9ca
Additional metadata in ES doc to citeproc conversion
kjw Nov 2, 2017
3638f7e
Add journal id to work mapping
kjw Nov 2, 2017
da4be75
Working select parameter for ES
kjw Nov 7, 2017
060b534
Sorting for ES queries and re-enable debug parameter
kjw Nov 7, 2017
ecb1344
Report elastic client config on debug=true
kjw Nov 7, 2017
440897a
Port work filters to ES
kjw Nov 7, 2017
e020d6c
Working work facets in ES
kjw Nov 9, 2017
4fc22ab
Include published-year work field for published facet
kjw Nov 9, 2017
cad75d5
Make all work fields keywords except those ending in .text
kjw Nov 9, 2017
ba7ae13
Specific field queries for ES
kjw Nov 9, 2017
820d2c4
Reverse work look ups for ES
kjw Nov 9, 2017
5c80b03
Working sample paramter for ES
kjw Nov 10, 2017
afbf962
Include _seq_no field for random sort (without field is deprecated)
kjw Nov 10, 2017
dc39e04
API cursors using ES scroll function
kjw Nov 10, 2017
642972f
Store member and journal coverage data in ES
kjw Nov 10, 2017
e22d924
Rename published to issued and published-year to issued-year
kjw Nov 10, 2017
7b2202b
Index subjects into journal mappings
kjw Nov 13, 2017
6abf57e
Handle case of no print publication date
kjw Nov 13, 2017
73353bc
Query for and display members from ES
kjw Nov 13, 2017
baa10d2
Journal query and display for ES
kjw Nov 13, 2017
715c98a
Prefix search and display for ES
kjw Nov 13, 2017
9898fe2
Remove old code for citation checking
kjw Nov 13, 2017
95ab0d2
Citation display decision fed by ES member data
kjw Nov 13, 2017
51a02e8
Remove old doaj code
kjw Nov 13, 2017
e8e2916
Remove old patent deposit code (now handled by event data)
kjw Nov 13, 2017
2eb646c
Remove mongo configuration
kjw Nov 13, 2017
c60260f
Use ES to serve /types/:type-id/works
kjw Nov 13, 2017
b65511c
Index feed documents into ES
kjw Nov 13, 2017
4859b6e
Removal of old, unused code
kjw Nov 14, 2017
43fb731
Set minimum_should_match for ISSN matching
kjw Nov 14, 2017
a19b5c4
Auto complete fields for member, journal and funder names
kjw Nov 14, 2017
7e16ff4
Complete ES to citeproc mapping
kjw Nov 14, 2017
02bc7d3
Display contributors correctly
kjw Nov 14, 2017
c6157b7
Index funders into ES. Now a single pass load
kjw Nov 14, 2017
5c04d57
Get funder data from ES for /funders
kjw Nov 15, 2017
974a516
Fixes for fundref namespace removal
kjw Nov 15, 2017
2775ed2
Remove deposits API and all mongo code
kjw Nov 15, 2017
29952f4
Reference count and DOI updates for ES
kjw Nov 15, 2017
7343baf
Remove /licenses route (in favour of license facet)
kjw Nov 15, 2017
22420fb
Store some fields in work parent for aggregation at work level
kjw Nov 16, 2017
bbec6c1
Add link application facet
kjw Nov 16, 2017
0755fb3
Use copy_to to make parent aggregation fields
kjw Nov 17, 2017
2ace269
Index citation-id and book-id crm items
kjw Nov 17, 2017
3a4f349
New dates in selection of issued date
kjw Nov 17, 2017
5786bc2
Use most recent deposit date if a DOI has absolutely no other date
kjw Nov 17, 2017
8386551
Don't add nil relation or property values
kjw Nov 20, 2017
ac61715
Support query clauses that need bool must_not occurrence
kjw Nov 21, 2017
2bdd708
Use prefix reference distribution flag
kjw Nov 29, 2017
21f81a9
Fix reference distribution element name
kjw Nov 29, 2017
b1f9d0a
Public references are labelled "open", not "public"
kjw Nov 29, 2017
1f1906d
Projectile config to avoid searching test-data
kjw Dec 5, 2017
f4f53da
Minimal commits required after rebasing
markwoodhall Apr 4, 2018
595ec68
Update config to match master
markwoodhall Apr 6, 2018
81de840
Minor formatting fix
markwoodhall Apr 6, 2018
ae5ae9b
Remove tasks no longer used by elastic implementation
markwoodhall Apr 6, 2018
f134f05
Add accoc-exists from master
markwoodhall Apr 6, 2018
be05ce0
Parse coverage stats as doubles
markwoodhall Apr 6, 2018
e376513
Add some missing index mappings
markwoodhall Apr 6, 2018
5ba8ffa
Get hour, min, and second where applicable
markwoodhall Apr 6, 2018
372523f
Consider same dates as master for issued date, store published date
markwoodhall Apr 6, 2018
19bef44
Store contributor sequence
markwoodhall Apr 6, 2018
8b19f16
Include contributor sequence in citeproc
markwoodhall Apr 6, 2018
c1722a8
Store journal language and item source
markwoodhall Apr 6, 2018
aaf560f
Replace many used of assoc with assoc-exists, to match master
markwoodhall Apr 6, 2018
e98ae2b
Unparse date with no milliseconds
markwoodhall Apr 6, 2018
3e6273d
Make deposited and first-deposited implementation the same as master
markwoodhall Apr 6, 2018
65b3176
Update member name, prefix, and location to match master
markwoodhall Apr 6, 2018
b31bce3
Add support for member coverage when fetching more than one member
markwoodhall Apr 6, 2018
536cbfa
Add breakdown, subjects defaults to empty vector, coverage support wh…
markwoodhall Apr 6, 2018
2de8b6c
Use should instead of filter to influence score
markwoodhall Apr 6, 2018
4cf973a
Don't wait forever when indexing if counts don't match
markwoodhall Apr 6, 2018
ffaf913
docker-compose down before starting system
markwoodhall Apr 6, 2018
c9834f2
Renames and formatting
markwoodhall Apr 6, 2018
9aa1ee9
Formatting and whitespace fixes
markwoodhall Apr 6, 2018
cc15748
Port peer review from master
markwoodhall Apr 10, 2018
bf4dc5d
Port institutions from master
markwoodhall Apr 10, 2018
0d83e63
Port journal-issue from master
markwoodhall Apr 10, 2018
d134b26
Port free-to-read to master
markwoodhall Apr 10, 2018
3092136
published-print should be explicit or default
markwoodhall Apr 10, 2018
92b3e3d
ORCID should include URI as per master, optionally include authentica…
markwoodhall Apr 10, 2018
b60cb32
Remove rel from citeproc relations
markwoodhall Apr 10, 2018
8aab425
Port crossmark-unaware content-domains from master
markwoodhall Apr 10, 2018
c4f9ffd
Add missing institution index mapping
markwoodhall Apr 10, 2018
cfd54c1
Only get date-parts for various published dates
markwoodhall Apr 10, 2018
05c0842
Default titles so output matches master, always assoc publisher
markwoodhall Apr 10, 2018
46c42ef
Integer is not applicable for reference year
markwoodhall Apr 10, 2018
e4a9755
Use match query for field queries
markwoodhall Apr 19, 2018
6259ba3
Add score to the list of selectable scores
markwoodhall Apr 19, 2018
8153ba0
Formatting
markwoodhall Apr 19, 2018
62ee432
Add sorter to api-get
markwoodhall Apr 19, 2018
581f733
Keep waiting for indexing while doc count increases
markwoodhall Apr 23, 2018
2ea8b79
Remove extra fields from title text
markwoodhall Apr 23, 2018
e84ac55
Make loading differnt corpus data easier
markwoodhall Apr 24, 2018
000d797
Add citation matching tests
markwoodhall Apr 24, 2018
c229868
Formatting
markwoodhall Apr 24, 2018
d490b83
Update assertion data for minor elastic variations
markwoodhall Apr 24, 2018
2d6522f
Institution and assertion don't need to be nested
markwoodhall Apr 24, 2018
c1b37ac
Disble xpack security
markwoodhall Apr 24, 2018
13edad5
Add a simple test to prove sample still works
markwoodhall Apr 25, 2018
7f1b29b
Debug message
markwoodhall Apr 27, 2018
a04dd01
Port missing agency fix from master
markwoodhall Apr 27, 2018
0bd3b78
Bring funder indexing and api in line with master
markwoodhall Apr 27, 2018
8a691ab
Formatting
markwoodhall Apr 27, 2018
a30260d
Formatting
markwoodhall Apr 27, 2018
ff6544d
Fix typo in rdf node name
markwoodhall Apr 30, 2018
c6bc719
Support hierarchy and hierarchy names as per master
markwoodhall Apr 30, 2018
379342b
Fully implement funder hierarchy
markwoodhall May 4, 2018
4183bad
Wait for funder to be indexed
markwoodhall May 4, 2018
dffdac6
Sort funder data for deterministic compare
markwoodhall May 4, 2018
2bab369
Update works assertion data
markwoodhall May 4, 2018
f38ad2e
Extract long doi for transform resources
markwoodhall May 8, 2018
74e1441
expose issn type as :type not :kind
markwoodhall May 8, 2018
f906b47
Remove score comparsion from certain works tests
markwoodhall May 8, 2018
4d966e4
Use one shard for work index
markwoodhall May 8, 2018
8764d79
Add helper function to pretty print resource to path
markwoodhall May 8, 2018
a0094ac
Remove score comparison from journal work test
markwoodhall May 8, 2018
0beee15
Add breakdowns to member coverage
markwoodhall May 9, 2018
09832ad
Update subtypes for books
markwoodhall May 9, 2018
013a334
Ignore score during member works test
markwoodhall May 9, 2018
5fcf8d4
Update assertion data to reflect isbn and type fixes
markwoodhall May 10, 2018
06bd6b9
Remove load-test-journals
markwoodhall May 10, 2018
52d6b43
Set cr-funder-registry on start
markwoodhall May 10, 2018
50322a5
Delay to give coverage time to index
markwoodhall May 10, 2018
4354aaa
Update test corpus after rebase
markwoodhall May 11, 2018
3e1454c
Remove solr and outdated test documentation from readme
markwoodhall May 11, 2018
aabdb84
Fix typo
markwoodhall May 11, 2018
b29dca4
issue-36 funder route
MikeYalter Jul 11, 2018
ef97a66
allow nested multiple filters
MikeYalter Jul 11, 2018
7c49dd8
added tests to funders-test
MikeYalter Jul 12, 2018
104c100
comments and revert
MikeYalter Jul 12, 2018
5f34ea5
Add initial count-types implementation as per master PREP 55
markwoodhall Jul 12, 2018
a740615
added some comments
MikeYalter Jul 13, 2018
7294966
Merge pull request #42 from CrossRef/funder-route
afandian Jul 13, 2018
d2a37d8
Add complete count-types and coverage-type implementation
markwoodhall Jul 24, 2018
56c4bc6
Fix failing funder works test
markwoodhall Jul 25, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
/doc
/tmp/*
/log
/dev-resources/feeds/feed-processed
/.idea
pom.xml
.#*
*.class
Expand Down
6 changes: 6 additions & 0 deletions .projectile
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
+/src
+/test
+/README.md
+/Dockerfile
+/project.clj
+/docker-compose.yml
23 changes: 7 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Make sure you have these dependencies installed within your development environm
- Leiningen
- Docker

The tests require Docker to spin up service dependencies on development machine (SOLR and MongoDB). Download Docker for Mac from https://www.docker.com/docker-mac and confirm that it's installed with `docker-compose --version`.
The tests require Docker to spin up service dependencies on development machine (ElasticSearch). Download Docker for Mac from https://www.docker.com/docker-mac and confirm that it's installed with `docker-compose --version`.

### Preparing CSL Resources

Expand Down Expand Up @@ -51,7 +51,6 @@ Run as a production service with some profiles:
- :graph-api - Must be specified along with :api and :graph. Enables the graph API. Requires datomic leiningen profile.
- :feed-api - Must be specified along with :api. Enables the feed API for real-time metadata ingest.
- :process-feed-files - Run async processing of incoming feed files. Should be enabled with :feed-api.
- :solr-inserts - Run solr inserts. Should be enabled with :feed-api or instances perform OAI-PMH harvesting.

## Run as a Daemon

Expand Down Expand Up @@ -79,27 +78,19 @@ Create a docker image:

$ lein uberimage


## Running tests

Running with `lein test` should take care of creating any required infrastructure, typically MongoDB and Solr.
Running with `lein test` should take care of creating any required infrastructure, typically ElasticSearch.

The Solr instance will be created using docker image `crossref/cayenne-solr`, this docker image is available in docker hub but
can also be created locally by cloning `https://github.com/crossref/cayenne-solr` and running `docker image build ./ -t crossref/cayenne-solr`, building
the image locally is useful if you want to make changes to the Solr schema.
The ElasticSearch instance will be created using docker image `docker.elastic.co/elasticsearch/elasticsearch:6.2.3`.

In order for the tests to pass there must be a specific set of feed files present in the feed input directory, these feed files
are not currently in this repository because of distribution issues but this will be addressed. For now, if the expected number of feed files is not
present an exception will be thrown:
The default corpus loaded into ElasticSearch is located in `dev-resources/feeds/corpus`, you can switch to a different corpus using:

```
CAYENNE_API_TEST_CORPUS=/large-corpus lein test cayenne.corpus-test
```
actual: java.lang.Exception: The number of feed input files is not as expected. Expected to find 174 files in /home/markwoodhall/src/crossref/cayenne/dev-resources/feeds/source
```

Note. Occasionally HTTP Kit will hold onto port 3000 after starting the API, this can sometimes cause problems with multiple
test runs, running a subset, e.g. `lein test cayenne.works-test` is more reliable.

Running tests from the REPL will also work.
The example above switches to a larger corpus located in `dev-resources/feeds/large-corpus` for the specific test run. Keep in mind that many of tests rely on a specific corpus being loaded into ElasticSearch.

## Reference Visibility

Expand Down
Loading