-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unidic design flaw #118
Comments
In your particular example, if I ask for the two best results, I get とう instead of ちち: here I'm use MeCab/Unidic but should be the same with Kuromoji:
Isn't this one of the big reasons why these parsers give you N-best results, N>1? |
unidic 2.3.0 solves this problem for the specific case of this set of words |
Could you indicate more precisely what you mean by "unidic 2.3.0"? Do you have a URL you can share? Thanks. |
It's only listed on the "back-number" page: |
Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.
They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.
This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.
I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)
(weights are for illustration, I think they're too high to catch in all intended cases)
The text was updated successfully, but these errors were encountered: