Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in full-text search results #134

Open
kintopp opened this issue Jul 22, 2024 · 6 comments
Open

Discrepancy in full-text search results #134

kintopp opened this issue Jul 22, 2024 · 6 comments
Assignees
Labels
bug Something isn't working globalise

Comments

@kintopp
Copy link
Member

kintopp commented Jul 22, 2024

I've came across an unexpected discrepancy between local (i.e. based on a downloaded text version of a Globalise inventory) and online search results in the Transcriptions Viewer.

As an example, I’ve attached 9945.txt. If you do a whole word search in it using your favourite text editor, you’ll find 37 instances of "reuk" (i.e. as a whole word, not a compound). But if you carry out the same search online in the transcriptions viewer, you only get 13 hits:

Now take a look at line 4177 of the attached text file. Here, you’ll find this line, which is on page 77 of that inventory:

droog en met een benaude Reuk, dierhalven niet

Page 77 never shows up at all in the results (linked above) of the Transcriptions viewer. But if you navigate to inventory 9945, page 0077 in the viewer, you can see it's right there, a few lines down from the top: https://transcriptions.globalise.huygens.knaw.nl/detail/urn:globalise:NL-HaNA_1.04.02_9945_0077

So what is going on? At first, I thought, aha, the Transcriptions viewer has suddenly (due to a misconfiguration?) become case sensitive, since page 77 says, “een benaude Reuk”. But that’s not the issue. If I limit my own full-text search on the downloaded file to “Reuk” I get 7 results for inv. 9945. That’s still not the same as the Transcription viewer’s 13 hits. Moreover, if you look again at what it is finding, you can see quite clearly that it’s also highlighting upper-case instances of the word (see the results for page 78, for example).

I tried another test. I asked the transcriptions viewer to find “droog en met een benaude Reuk, dierhalven niet” (without the quotes) limited to inv. 9945. Now it did find that line, and even put it where it belonged, right at the top of the search results. I then tried the same thing again (with the quotes, as an exact phrase) and this also worked correctly. Finally, you see the discrepancy without the addition of the filter too. I don't have the exact figures with me now, but if I search for "reuk" locally, across all text inventory files, I get significantly more results than if I carry out the same search in the Globalise Transcriptions Viewer.

@kintopp kintopp added bug Something isn't working globalise labels Jul 22, 2024
@kintopp
Copy link
Member Author

kintopp commented Aug 4, 2024

Here's another example. If I search just for bandar (with or without quotes) in the Transcriptions Viewer, I get 415 results. If I search the same HTRv2 files locally on my Mac for bandar, whole-word, case-insensitive, I get 543 results. The inventories with the three highest number of instances are:

1103 (21)
1112 (21)
1111 (19)

If I search for bandar in the Transcriptions Viewer, each time limiting my search to these three inventory numbers, I get:

1103 (17)
1112 (15)
1111 (17)

So let's look at the other end of the scale. Based on my local copy of inv. 9893, I should find one instance of bandar there. In the Transcriptions Viewer, I do not. But if I look at the found instance in my text file, I see it looks like this:

badde pattoe aan rdijejewickreme Bandar„

In the HTR this can be found here. The reading order is bad on this page, but if I adjust the position of the scan relative to the transcription, Bandar can be clearly seen:

Screenshot 2024-08-04 at 13 43 07

So one issue, at least, is that ES is treating the at the end of Bandar„ as being part of the word, and thus if you're looking for Bandar it's not finding it. This also applies to other trailing (and perhaps leading?) punctuation characters such as ¬ that are 'attached' to words. Take a look at inv. 1108 (the numbers after the 1108 are the line numbers of the text file from Dataverse of that inventory):

1108:6063: Cargatoen, ende den Coopman Gerrit Corsz naer bandar¬
1108:14414: leijden uijt Bandar Gamcon weder om naer
1108:14687: en Stadt Ogle bandar leggende opde reviere
1108:43173: ende met soo een Notabelen parthije zyde In Bandar Gamron
1108:43183: In Bandar Conge: ganderen, Door welcken abundanten toeboer
1108:43310: wat marct uwe coopmanschappen, Jn Bandar Gamron, Caelen
1108:43370: ende tydelyck In Bandar Gamron, affcomen, om soo veel vande
1108:43456: alder buijtersten vlt:o Novembr, In Bandar Gamron, om
1108:43688: In Bandar Gamron, ofte op apparentie van avantagienser
1108:49571: Bengala, Cogle Bandar), gelegen aende reviere ganges getu„
1108:49582: andere oock ogle Bandar met onse t achten te bevaren, ende
1108:50306: onse aengebrachte cargezoenen meest in Bandar gamron tegens
1108:55447: desselfs bandar, ende op de sijne stroomen
1108:57449: Coopluijden van menichvuldige plaetsen in Bandar Gemoon comen om haere

There are 14 results here, but the Transcriptions Viewer lists 10. Three of the four it did not appear find are:

bandar¬
Bandar),
bandar,

But all the remaining instances of bandar look completely normal. So what is going on? There is a small, additional factor at play here. If you look again at the Transcriptions Viewer results for bandar in inv. 1108 you'll see that it says it found 10 results but actually shows 11 results in that inventory (14 minus the three examples above with trailing non-letter characters). That, in turn, may be because on one page (1108:0952) the word shows up twice, but is only counted once in the results.

In any case, it seems clear that we need to offer our users a more detailed and fuller explanation of what can be searched for and what cannot, and consequently what counts as a 'result'. And then suggests workarounds, insofar as these are available.

@svandaalen
Copy link
Collaborator

svandaalen commented Aug 4, 2024

Pardon my late reply. It's hectic at the moment with two nearing deadlines :).

As stated in my DM last week, the issue is twofold:

  1. The first issue is the ES tokeniser we are using. We use the whitespace tokeniser, which includes/leaves the interpunction in the token, as you can see in the example they give. Their standard tokeniser excludes/strips the interpunction from the token (again, see the example they give). Switching to the standard tokeniser will (most likely) fix the problem where words that include interpunction are not found. We will experiment with this later this year. For now, this can be sort of 'fixed' by searching for something like reuk*, but this will not work for everything.

  2. The way ES gives back hits. ES works with documents, meaning that when a user searches for something with ES, ES will return all the documents in which that query has a hit. For Globalise, the ES document equals a page from an inventory. If a word occurs more than once on that page, it's still one hit for ES because the document is the hit, not all individual matching words in that document. If I change your reuk query from above to search for reuk* to circumvent the interpunction issue from above (https://transcriptions.globalise.huygens.knaw.nl/?indexName=docs-2024-03-18&fragmentSize=100&from=0&size=10&sortBy=_score&sortOrder=desc&query=eyJkYXRlRnJvbSI6IjE1MDAtMDEtMDEiLCJkYXRlVG8iOiIxODAwLTAxLTAxIiwicmFuZ2VGcm9tIjoiMCIsInJhbmdlVG8iOiIzMDAwMCIsImZ1bGxUZXh0IjoicmV1ayoiLCJ0ZXJtcyI6eyJpbnZOciI6WyI5OTQ1Il19fQ%3D%3D), you see that reuk* occurs on 16 pages. As you can also see, reuk* often occurs more than once on a page. If you count all the hits of reuk* in TAV, you will count 37 individual hits, so TAV does find all instances of reuk* in this inventory. I am currently unsure whether ES can return all individual hits. We might make it clearer for the user by tackling Revise wording for search results counter to distinguish pages from instances #72. This issue will be worked on later this year.

If this issue has a high priority, please take it up with Hennie so he can try to fit it into our tight schedule :). I am available again on Tuesday.

@kintopp
Copy link
Member Author

kintopp commented Aug 4, 2024

Thanks, Sebastiaan – no need to reply on the weekend! And super ironically, I just see now, consulting the Transcription Viewer's Help, that'd we'd already identified this back when the viewer launched in October last year, but that I'd forgotten about it. We'll discuss this some more inside Globalise and get back to you.

@marijnkoolen
Copy link

I'd like to bump the priority of this, as I'm fairly sure that few users expect punctuation to make a difference. It has a significant impact for REPUBLIC as well.

I don't know what the reason is for choosing the whitespace tokenizer, but hyphenated words (where I can see the benefit of leaving in 'punctuation') are rare in the resolutions, and punctuation is fairly common. Also, sometimes words are accidentally concatenated with punctuation in between (e.g. "doen.Is") by improper merging of lines , so I think that for REPUBLIC it is safe and almost always beneficial to switch to the ES default tokenizer.

I'll let Hennie know.

@kintopp kintopp assigned kintopp and hayco and unassigned kintopp Nov 18, 2024
@kintopp
Copy link
Member Author

kintopp commented Nov 18, 2024

@hennie I didn't see this on the team text backlog yet (I think). Could it be please be added? In effect, this is the same as https://github.com/knaw-huc/team-text-backlog/issues/89

@hayco
Copy link

hayco commented Nov 18, 2024

See also https://github.com/knaw-huc/team-text-backlog/issues/89 and https://github.com/knaw-huc/team-text-backlog/issues/75.
If, as you noted, this is the same as those, it should be solved once we do the migration + reindex using the (default) tokenizer.
It is definitely worth checking afterwards, to see if it did indeed solve any discrepancies you noted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working globalise
Projects
None yet
Development

No branches or pull requests

4 participants