-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
462 simplify oai pmh process #489
base: master
Are you sure you want to change the base?
Conversation
It builds locally but the tests fail. |
Tests are now working. |
I put this PR on halt after the next deployment. |
65cfc83
to
ae6fd45
Compare
Get rid of xml splitting process and only use a single OAI-PMH inerval steps due to newer pica bulks.
082ac9e
to
780d575
Compare
I updated this branch and got rid of all conflicts. @dr0i could you have a look.
|
#462 (comment) @fsteeg hinted that the saving of the oai pmh data was because of rerunning the transformation without getting the data from the oai pmh. Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes. Would this be sufficient? |
This will change over time - after 11 months or so , if there are more changes. One cannot be sure. Also, I don't know, what happens if the download of updates does not work while wanting to reindex everything ? Would then all the updates be missing? |
But then there will be a new batch file and we need to exchange the old and update the starting date for the update. We will get a email reminder for that, see: #504
At the moment at night the index is build from the start, fetch ALL data from OAI -> split them -> transform them -> index them. The source data is only kept to rerun the transformation, manually. Or do I miss something,. We could add a step that saves the OAI PMH Data in an XML file like here: https://gitlab.com/oersi/oersi-etl/-/blob/master/data/production/duepublico/duepublico-to-oersi.flux?ref_type=heads#L7-15 And test if the new data is bigger or equal to the old. I am not sure how to do this with JAVA. |
Oh, right - so we at least have something. I think this is sufficient. |
|
||
Thus, you will have specify two parameters in @conf/application.conf@ : (1) the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01) and (2) the interval size in days (must not be too large). | ||
Thus, you will have specify one parameters in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that , if I understand correctly, 365 days (the max?!) won't be "too large"?
app/transformation/TransformAll.java
Outdated
TransformDbs.process(dbsOutput, geoServer,wikidataLookupFilename); //Start process DBS data. | ||
|
||
String sigelBulkOutput = outputPath + "-sigelBulk"; | ||
String sigelUpdatesOutput = outputPath + "-sigelUpdates"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are superflous tabs at the end of the line.
app/transformation/TransformAll.java
Outdated
writeAll(sigelBulkOutput, resultWriter); | ||
if (startOfUpdates != "") { // exclude updates for the tests, which set startOfUpdates to "" | ||
writeAll(sigelUpdatesOutput, resultWriter); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are superflous tabs at the end of the line.
JsonEncoder encodeJson = new JsonEncoder(); | ||
encodeJson.setPrettyPrinting(true); | ||
ObjectWriter objectWriter = new ObjectWriter<>(outputPath); | ||
objectWriter.setAppendIfFileExists(true); | ||
splitFileOpener// | ||
sigelOaiPmhUpdates// |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superflous tabs ateol
@dr0i Could you review it again? |
@@ -61,15 +59,15 @@ public static void tearDown() { | |||
|
|||
@Test | |||
public void multiLangAlternateName() throws IOException { | |||
assertThat(new String( | |||
assertThat(new String( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superflous tabulator
Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE)))) | ||
.as("transformation output with multiLangAlternateName") | ||
.contains("Leibniz Institute").contains("Berlin SBB"); | ||
} | ||
|
||
@Test | ||
public void separateUrlAndProvidesFields() throws IOException { | ||
assertThat(new String( | ||
assertThat(new String( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superflous tabulator
@@ -78,7 +76,7 @@ public void separateUrlAndProvidesFields() throws IOException { | |||
|
|||
@Test | |||
public void preferSigelData() throws IOException { | |||
assertThat(new String( | |||
assertThat(new String( | |||
Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superflous tabulator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README seems to be far outadated (mentioning metafacture 4.0.0 etc). I will provide an update of the README and btw. test if the data is updated as expected.
@dr0i I have adjusted the formatting |
Related to #462
This PR does four things:
Context:
The old process was overly complex since it set intervals for harvesting and has split all files into single xmls.
The intervals were probably needed since there were a lot of updates due to the old pica xml bulk from 2013 that we do not use anymore. And it is from a time, when all records where merge: https://github.com/search?q=repo%3Ahbz%2Flobid-organisations+interval&type=pullrequests
The splitting into single xmls is not needed since it was only used for the updates.
Three questions need to be answered:
How to update the tests, they are still depending on the old XML-Splitting-Process?
I skip the OAI-PMH with a fake date and a zero interval. Is there a way to skip the OAI-PMH-Process if the OAI-PMH returns an error. ae6fd45
Do we want the oai pmh harvested data as an xml file saved?
Do we need a fallback if we cannot reach the oai-pmh?
This would also fix #487