-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File name parsing for Merton search problems #21
Comments
Yeah, I think I mentioned in #17 that sanitizing the query string would greatly help with matching.
Possibly using SearchVector/SearchRank or a Trigram Similarity in Django/PostgreSQL may improve results, but there is a significant performance penalty to be paid. Might be worth testing to see if it gives improved results. |
I was all set to tell you it does sanitise and then I checked and it does not. I'm wonder what the reason was... I think it may have been because Metron was returning some tests without it but obviously there is still some sanitising needed. |
Yeah, it's much easier on my end to see what metron_talker is submitting for a series name. 😉 |
I tried using the same as CT but that causes the |
Seems to work better for some titles! |
Looked into this a bit more today, and played around with Postgresql's Trigram Similarity Matching, which helps deal with the apostrophe matching issues. Unfortunately, Django's support isn't implemented with Transform, so I'd lose support for Unaccent and other lookup options, unless I made some hand-crafted artisanal SQL statements (which I don't really have the time to do), so this isn't a great solution. |
I've been a bit slow on the talker front but if you have any examples of problem titles I'll see what I can do (and put in some test). |
I think the classic problem title in this vein is: "Batman/Superman: World's Finest" which includes a slash, a colon, and an apostrophe. Filenames might have the slash and colon replaced with a space or a "-". The apostrophe often seem to typically be just removed. (I can't remember if that's an problem character on Windows filesystems?) "Batman - Superman- Worlds Finest" I think in general the apostrophe (in English anyways) is most problematic for filename-to-search, since it tends to be replaced with nothing rather than a space in some filenames. |
Looks like Metron search based a parsed file name (auto-tagging) is failing some cases.
A series title with colons (
:
) and slashes (/
) will of have those replaced with space-minus-space (-
) in a filename. A great example is "Batman / Superman: World's Finest" which might have a filename withBatman - Superman - World's Finest
(or even one without the'
in it) The minuses cause the Metron search the search to fail.Probably CT just needs to remove minus/dash characters (
-
) from the search string before submitting it.Another issue I think is probably server-side, and I should probably make a issue with @bpepple over on his repo. Metron doesn't like a missing apostrophe in some cases.
So when the search string is:
Cory Doctorow's Futuristic Tales of the Here and Now
it worksCory Doctorow s Futuristic Tales of the Here and Now
also worksCory Doctorow Futuristic Tales of the Here and Now
also worksbut for some reason
Cory Doctorows Futuristic Tales of the Here and Now
fails.Unfortunately for auto-tagging, it's pretty common to see the dropped apostrophe. I can't think of a good client-side solution for that one, though, but maybe you all have an idea.
Thanks!
The text was updated successfully, but these errors were encountered: