-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizing text in Hiragana character set #105
Comments
This is a common issue because when the tokenization model is trained, it looks at the surface feature (and POS, base form, conjugation form, etc.) rather than its reading. One way to handle this is to add だいがくせい to the user dictionary.
|
Nate, could you share more information on your use-case and what you would like to accomplish? |
Reading extraction is such an awesome feature. :) However, the feature works only if <大学生> and <だいがくせい> are tokenized the same way, doesn't it? For example, I'd like be able to find a document that contains the word in Kanji <大学生> using a search query with the same word in Hiragana <だいがくせい>. I am finding cases that such searches do not work even with the reading normalization due to the difference in tokenizations. Thanks for the help! |
Let's try not to be confused about the "feature" used for the machine learning models and the "feature" for Kuromoji. The word "feature" has a special meaning in the context of machine learning, so I prefer no to use it in other way in this context. I do not have any additional comments other than recommending to use a the user dictionary feature to make the tokenization consistent. Christian should have some additional comments. |
Sorry if my question wasn't clear. Let me know if you need me to clarify anything. @cmoen Does the use case sound reasonable to you? |
Tokenizing a sentence "寿司が美味しい。" produces the following tokens:
<寿司>,<が>,<美味しい>,<。>
Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great.
<すし>,<が>,<おいしい>,<。>
However, for some other words, tokenization behavior depends on the input character set.
For example, for "大学生":
The word is correctly tokenized into <大学生> if the input character set was Kanji.
When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens.
<だい>,<が>,<くせ>,<い>.
Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?
Thanks in advance for your help!
The text was updated successfully, but these errors were encountered: