-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy in full-text search results #134
Comments
Here's another example. If I search just for 1103 (21) If I search for 1103 (17) So let's look at the other end of the scale. Based on my local copy of inv. 9893, I should find one instance of
In the HTR this can be found here. The reading order is bad on this page, but if I adjust the position of the scan relative to the transcription, Bandar can be clearly seen: So one issue, at least, is that ES is treating the 1108:6063: Cargatoen, ende den Coopman Gerrit Corsz naer There are 14 results here, but the Transcriptions Viewer lists 10. Three of the four it did not appear find are:
But all the remaining instances of In any case, it seems clear that we need to offer our users a more detailed and fuller explanation of what can be searched for and what cannot, and consequently what counts as a 'result'. And then suggests workarounds, insofar as these are available. |
Pardon my late reply. It's hectic at the moment with two nearing deadlines :). As stated in my DM last week, the issue is twofold:
If this issue has a high priority, please take it up with Hennie so he can try to fit it into our tight schedule :). I am available again on Tuesday. |
Thanks, Sebastiaan – no need to reply on the weekend! And super ironically, I just see now, consulting the Transcription Viewer's Help, that'd we'd already identified this back when the viewer launched in October last year, but that I'd forgotten about it. We'll discuss this some more inside Globalise and get back to you. |
I'd like to bump the priority of this, as I'm fairly sure that few users expect punctuation to make a difference. It has a significant impact for REPUBLIC as well. I don't know what the reason is for choosing the whitespace tokenizer, but hyphenated words (where I can see the benefit of leaving in 'punctuation') are rare in the resolutions, and punctuation is fairly common. Also, sometimes words are accidentally concatenated with punctuation in between (e.g. "doen.Is") by improper merging of lines , so I think that for REPUBLIC it is safe and almost always beneficial to switch to the ES default tokenizer. I'll let Hennie know. |
@hennie I didn't see this on the team text backlog yet (I think). Could it be please be added? In effect, this is the same as https://github.com/knaw-huc/team-text-backlog/issues/89 |
See also https://github.com/knaw-huc/team-text-backlog/issues/89 and https://github.com/knaw-huc/team-text-backlog/issues/75. |
I've came across an unexpected discrepancy between local (i.e. based on a downloaded text version of a Globalise inventory) and online search results in the Transcriptions Viewer.
As an example, I’ve attached 9945.txt. If you do a whole word search in it using your favourite text editor, you’ll find 37 instances of "reuk" (i.e. as a whole word, not a compound). But if you carry out the same search online in the transcriptions viewer, you only get 13 hits:
Now take a look at line 4177 of the attached text file. Here, you’ll find this line, which is on page 77 of that inventory:
Page 77 never shows up at all in the results (linked above) of the Transcriptions viewer. But if you navigate to inventory 9945, page 0077 in the viewer, you can see it's right there, a few lines down from the top: https://transcriptions.globalise.huygens.knaw.nl/detail/urn:globalise:NL-HaNA_1.04.02_9945_0077
So what is going on? At first, I thought, aha, the Transcriptions viewer has suddenly (due to a misconfiguration?) become case sensitive, since page 77 says, “een benaude Reuk”. But that’s not the issue. If I limit my own full-text search on the downloaded file to “Reuk” I get 7 results for inv. 9945. That’s still not the same as the Transcription viewer’s 13 hits. Moreover, if you look again at what it is finding, you can see quite clearly that it’s also highlighting upper-case instances of the word (see the results for page 78, for example).
I tried another test. I asked the transcriptions viewer to find “droog en met een benaude Reuk, dierhalven niet” (without the quotes) limited to inv. 9945. Now it did find that line, and even put it where it belonged, right at the top of the search results. I then tried the same thing again (with the quotes, as an exact phrase) and this also worked correctly. Finally, you see the discrepancy without the addition of the filter too. I don't have the exact figures with me now, but if I search for "reuk" locally, across all text inventory files, I get significantly more results than if I carry out the same search in the Globalise Transcriptions Viewer.
The text was updated successfully, but these errors were encountered: