462 simplify oai pmh process #489

TobiasNx · 2023-08-22T14:20:58Z

Related to #462

This PR does four things:

get rid of the XML Splitting of the OAI-Updates
get rid of intervals for the OAI Updates
splits the outcome of the OAI Updates and the PICA binary Bulk before writing the results together
Updates the tests accordingly

Context:

The old process was overly complex since it set intervals for harvesting and has split all files into single xmls.
The intervals were probably needed since there were a lot of updates due to the old pica xml bulk from 2013 that we do not use anymore. And it is from a time, when all records where merge: https://github.com/search?q=repo%3Ahbz%2Flobid-organisations+interval&type=pullrequests
The splitting into single xmls is not needed since it was only used for the updates.

Three questions need to be answered:

How to update the tests, they are still depending on the old XML-Splitting-Process?
I skip the OAI-PMH with a fake date and a zero interval. Is there a way to skip the OAI-PMH-Process if the OAI-PMH returns an error. ae6fd45
Do we want the oai pmh harvested data as an xml file saved?
Do we need a fallback if we cannot reach the oai-pmh?

This would also fix #487

TobiasNx · 2023-08-22T15:04:34Z

It builds locally but the tests fail.

TobiasNx · 2023-08-22T19:25:09Z

Tests are now working.

TobiasNx · 2023-08-23T07:02:02Z

I put this PR on halt after the next deployment.

Get rid of xml splitting process and only use a single OAI-PMH inerval steps due to newer pica bulks.

TobiasNx · 2023-10-12T09:31:14Z

I updated this branch and got rid of all conflicts. @dr0i could you have a look.
Also could you help to answer the following questions:

Do we want the oai pmh harvested data as an xml file saved?
Do we need a fallback if we cannot reach the oai-pmh?

TobiasNx · 2023-10-12T09:49:25Z

#462 (comment) @fsteeg hinted that the saving of the oai pmh data was because of rerunning the transformation without getting the data from the oai pmh.

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes. Would this be sufficient?

dr0i · 2023-10-19T08:58:27Z

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes

This will change over time - after 11 months or so , if there are more changes. One cannot be sure. Also, I don't know, what happens if the download of updates does not work while wanting to reindex everything ? Would then all the updates be missing?

TobiasNx · 2023-10-19T09:51:49Z

Due to the new pica binary bulk fetching all new sigel updates+transformation only takes 3 minutes

This will change over time - after 11 months or so , if there are more changes.

But then there will be a new batch file and we need to exchange the old and update the starting date for the update. We will get a email reminder for that, see: #504

One cannot be sure. Also, I don't know, what happens if the download of updates does not work while wanting to reindex everything? Would then all the updates be missing?

At the moment at night the index is build from the start, fetch ALL data from OAI -> split them -> transform them -> index them.
If I do not mistaken all splittet files are overwritten. We already loose the source data. Only the index data is not lost, if the incoming index data is equal or bigger than the old file.

The source data is only kept to rerun the transformation, manually. Or do I miss something,.

We could add a step that saves the OAI PMH Data in an XML file like here: https://gitlab.com/oersi/oersi-etl/-/blob/master/data/production/duepublico/duepublico-to-oersi.flux?ref_type=heads#L7-15

And test if the new data is bigger or equal to the old.

I am not sure how to do this with JAVA.

dr0i · 2023-10-19T14:24:23Z

Only the index data is not lost, if the incoming index data is equal or bigger than the old file.

Oh, right - so we at least have something. I think this is sufficient.

dr0i · 2023-10-19T14:26:03Z

README.textile


-Thus, you will have specify two parameters in @conf/application.conf@ : (1) the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01) and (2) the interval size in days (must not be too large). 
+Thus, you will have specify one parameters in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01).



Are you sure that , if I understand correctly, 365 days (the max?!) won't be "too large"?

dr0i · 2023-10-19T14:32:11Z

app/transformation/TransformAll.java

-		TransformDbs.process(dbsOutput, geoServer,wikidataLookupFilename); //Start process DBS data.
-
+		String sigelBulkOutput = outputPath + "-sigelBulk";
+		String sigelUpdatesOutput = outputPath + "-sigelUpdates";		


There are superflous tabs at the end of the line.

dr0i · 2023-10-19T14:32:43Z

app/transformation/TransformAll.java

+			writeAll(sigelBulkOutput, resultWriter);
+			if (startOfUpdates != "") { // exclude updates for the tests, which set startOfUpdates to ""
+				writeAll(sigelUpdatesOutput, resultWriter);
+			}			


There are superflous tabs at the end of the line.

dr0i · 2023-10-19T14:34:10Z

app/transformation/TransformSigel.java

 		JsonEncoder encodeJson = new JsonEncoder();
 		encodeJson.setPrettyPrinting(true);
 		ObjectWriter objectWriter = new ObjectWriter<>(outputPath);
 		objectWriter.setAppendIfFileExists(true);
-		splitFileOpener//
+		sigelOaiPmhUpdates//			


superflous tabs ateol

TobiasNx · 2024-06-04T15:49:53Z

@dr0i Could you review it again?

dr0i · 2024-09-03T14:10:46Z

test/transformation/TestTransformAll.java

@@ -61,15 +59,15 @@ public static void tearDown() {

 	@Test
 	public void multiLangAlternateName() throws IOException {
-		assertThat(new String(
+				assertThat(new String(


superflous tabulator

dr0i · 2024-09-03T14:10:59Z

test/transformation/TestTransformAll.java

 				Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE))))
 						.as("transformation output with multiLangAlternateName")
 						.contains("Leibniz Institute").contains("Berlin SBB");
 	}

 	@Test
 	public void separateUrlAndProvidesFields() throws IOException {
-		assertThat(new String(
+				assertThat(new String(


superflous tabulator

dr0i · 2024-09-03T14:11:12Z

test/transformation/TestTransformAll.java

@@ -78,7 +76,7 @@ public void separateUrlAndProvidesFields() throws IOException {

 	@Test
 	public void preferSigelData() throws IOException {
-		assertThat(new String(
+				assertThat(new String(
 				Files.readAllBytes(Paths.get(TransformAll.DATA_OUTPUT_FILE))))


superflous tabulator

dr0i

The README seems to be far outadated (mentioning metafacture 4.0.0 etc). I will provide an update of the README and btw. test if the data is updated as expected.

TobiasNx · 2024-09-18T06:42:46Z

@dr0i I have adjusted the formatting

TobiasNx requested a review from dr0i August 22, 2023 14:21

TobiasNx assigned dr0i Aug 22, 2023

TobiasNx marked this pull request as draft August 23, 2023 07:01

TobiasNx force-pushed the 462-simplifyOaiPmhProcess branch from 65cfc83 to ae6fd45 Compare September 18, 2023 08:58

TobiasNx changed the base branch from useFixInsteadOfMorph to master September 18, 2023 08:59

TobiasNx mentioned this pull request Sep 21, 2023

Update Sigil process #462

Open

2 tasks

TobiasNx marked this pull request as ready for review September 21, 2023 15:10

TobiasNx added 3 commits October 12, 2023 11:03

Simplify the oai pmh update process for Sigel data #462

edb07ce

Get rid of xml splitting process and only use a single OAI-PMH inerval steps due to newer pica bulks.

Separate output files for bulk and updates #462

bddc3fe

Exclude update-process for tests #462

780d575

TobiasNx force-pushed the 462-simplifyOaiPmhProcess branch from 082ac9e to 780d575 Compare October 12, 2023 09:25

TobiasNx requested review from dr0i and removed request for dr0i October 12, 2023 09:29

dr0i assigned TobiasNx and unassigned dr0i Oct 19, 2023

TobiasNx assigned dr0i and unassigned TobiasNx Oct 19, 2023

dr0i requested changes Oct 19, 2023

View reviewed changes

dr0i assigned TobiasNx and unassigned dr0i Oct 19, 2023

TobiasNx and others added 2 commits June 4, 2024 17:44

Add suggestion from reviewer #462

decd84f

Merge branch 'master' into 462-simplifyOaiPmhProcess

514decb

TobiasNx requested a review from dr0i June 4, 2024 15:49

dr0i assigned dr0i and unassigned TobiasNx Jun 14, 2024

dr0i reviewed Sep 3, 2024

View reviewed changes

dr0i requested changes Sep 3, 2024

View reviewed changes

Adjust formatting

2cefa70

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

462 simplify oai pmh process #489

462 simplify oai pmh process #489

TobiasNx commented Aug 22, 2023 •

edited

Loading

TobiasNx commented Aug 22, 2023

TobiasNx commented Aug 22, 2023

TobiasNx commented Aug 23, 2023

TobiasNx commented Oct 12, 2023

TobiasNx commented Oct 12, 2023

dr0i commented Oct 19, 2023

TobiasNx commented Oct 19, 2023

dr0i commented Oct 19, 2023

dr0i Oct 19, 2023

dr0i Oct 19, 2023

dr0i Oct 19, 2023

dr0i Oct 19, 2023

TobiasNx commented Jun 4, 2024

dr0i Sep 3, 2024

dr0i Sep 3, 2024

dr0i Sep 3, 2024

dr0i left a comment

TobiasNx commented Sep 18, 2024


		Thus, you will have specify two parameters in @conf/application.conf@ : (1) the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01) and (2) the interval size in days (must not be too large).
		Thus, you will have specify one parameters in @conf/application.conf@ : the date from which the updates start (usually the date of the base dump creation, e.g. 2013-06-01).

462 simplify oai pmh process #489

Are you sure you want to change the base?

462 simplify oai pmh process #489

Conversation

TobiasNx commented Aug 22, 2023 • edited Loading

TobiasNx commented Aug 22, 2023

TobiasNx commented Aug 22, 2023

TobiasNx commented Aug 23, 2023

TobiasNx commented Oct 12, 2023

TobiasNx commented Oct 12, 2023

dr0i commented Oct 19, 2023

TobiasNx commented Oct 19, 2023

dr0i commented Oct 19, 2023

dr0i Oct 19, 2023

Choose a reason for hiding this comment

dr0i Oct 19, 2023

Choose a reason for hiding this comment

dr0i Oct 19, 2023

Choose a reason for hiding this comment

dr0i Oct 19, 2023

Choose a reason for hiding this comment

TobiasNx commented Jun 4, 2024

dr0i Sep 3, 2024

Choose a reason for hiding this comment

dr0i Sep 3, 2024

Choose a reason for hiding this comment

dr0i Sep 3, 2024

Choose a reason for hiding this comment

dr0i left a comment

Choose a reason for hiding this comment

TobiasNx commented Sep 18, 2024

TobiasNx commented Aug 22, 2023 •

edited

Loading