Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenizing text in Hiragana character set #105

Open
mhko opened this issue Jun 8, 2016 · 5 comments
Open

Tokenizing text in Hiragana character set #105

mhko opened this issue Jun 8, 2016 · 5 comments

Comments

@mhko
Copy link

mhko commented Jun 8, 2016

Tokenizing a sentence "寿司が美味しい。" produces the following tokens:
<寿司>,<が>,<美味しい>,<。>

Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great.
<すし>,<が>,<おいしい>,<。>

However, for some other words, tokenization behavior depends on the input character set.

For example, for "大学生":

The word is correctly tokenized into <大学生> if the input character set was Kanji.

When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens.
<だい>,<が>,<くせ>,<い>.

Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?

Thanks in advance for your help!

@akkikiki
Copy link
Contributor

akkikiki commented Jun 9, 2016

This is a common issue because when the tokenization model is trained, it looks at the surface feature (and POS, base form, conjugation form, etc.) rather than its reading.
The example of だいがくせい is quite easy, but in general, a sentence with all in hiragana or any single script type is hard to tokenize because it increases the ambiguity of the segmentation.

One way to handle this is to add だいがくせい to the user dictionary.
E.g.

だいがくせい,1285,1285,some integer,名詞,一般,*,*,*,*,だいがくせい,ダイガクセイ,ダイガクセイ

@cmoen
Copy link
Member

cmoen commented Jun 9, 2016

Nate, could you share more information on your use-case and what you would like to accomplish?

@mhko
Copy link
Author

mhko commented Jun 10, 2016

Reading extraction is such an awesome feature. :) However, the feature works only if <大学生> and <だいがくせい> are tokenized the same way, doesn't it? For example, I'd like be able to find a document that contains the word in Kanji <大学生> using a search query with the same word in Hiragana <だいがくせい>. I am finding cases that such searches do not work even with the reading normalization due to the difference in tokenizations.

Thanks for the help!

@akkikiki
Copy link
Contributor

akkikiki commented Jun 10, 2016

Let's try not to be confused about the "feature" used for the machine learning models and the "feature" for Kuromoji. The word "feature" has a special meaning in the context of machine learning, so I prefer no to use it in other way in this context.

I do not have any additional comments other than recommending to use a the user dictionary feature to make the tokenization consistent. Christian should have some additional comments.

@mhko
Copy link
Author

mhko commented Jun 16, 2016

Sorry if my question wasn't clear. Let me know if you need me to clarify anything.

@cmoen Does the use case sound reasonable to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants