Tokenizing text in Hiragana character set #105

mhko · 2016-06-08T16:11:56Z

Tokenizing a sentence "寿司が美味しい。" produces the following tokens:
<寿司>,<が>,<美味しい>,<。>

Tokenizing the same sentence written only in hiragana character exhibits identical behavior which is great.
<すし>,<が>,<おいしい>,<。>

However, for some other words, tokenization behavior depends on the input character set.

For example, for "大学生":

The word is correctly tokenized into <大学生> if the input character set was Kanji.

When the input character set was Hiragana, "だいがくせい", the same word produces the following tokens.
<だい>,<が>,<くせ>,<い>.

Is this a known issue? Is there any configuration I could tweak so that the two cases behaves the same way regardless of the input character set?

Thanks in advance for your help!

akkikiki · 2016-06-09T01:48:18Z

This is a common issue because when the tokenization model is trained, it looks at the surface feature (and POS, base form, conjugation form, etc.) rather than its reading.
The example of だいがくせい is quite easy, but in general, a sentence with all in hiragana or any single script type is hard to tokenize because it increases the ambiguity of the segmentation.

One way to handle this is to add だいがくせい to the user dictionary.
E.g.

だいがくせい,1285,1285,some integer,名詞,一般,*,*,*,*,だいがくせい,ダイガクセイ,ダイガクセイ

cmoen · 2016-06-09T02:39:52Z

Nate, could you share more information on your use-case and what you would like to accomplish?

mhko · 2016-06-10T01:03:45Z

Reading extraction is such an awesome feature. :) However, the feature works only if <大学生> and <だいがくせい> are tokenized the same way, doesn't it? For example, I'd like be able to find a document that contains the word in Kanji <大学生> using a search query with the same word in Hiragana <だいがくせい>. I am finding cases that such searches do not work even with the reading normalization due to the difference in tokenizations.

Thanks for the help!

akkikiki · 2016-06-10T01:59:15Z

Let's try not to be confused about the "feature" used for the machine learning models and the "feature" for Kuromoji. The word "feature" has a special meaning in the context of machine learning, so I prefer no to use it in other way in this context.

I do not have any additional comments other than recommending to use a the user dictionary feature to make the tokenization consistent. Christian should have some additional comments.

mhko · 2016-06-16T01:20:33Z

Sorry if my question wasn't clear. Let me know if you need me to clarify anything.

@cmoen Does the use case sound reasonable to you?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizing text in Hiragana character set #105

Tokenizing text in Hiragana character set #105

mhko commented Jun 8, 2016

akkikiki commented Jun 9, 2016 •

edited

Loading

cmoen commented Jun 9, 2016

mhko commented Jun 10, 2016

akkikiki commented Jun 10, 2016 •

edited

Loading

mhko commented Jun 16, 2016

Tokenizing text in Hiragana character set #105

Tokenizing text in Hiragana character set #105

Comments

mhko commented Jun 8, 2016

akkikiki commented Jun 9, 2016 • edited Loading

cmoen commented Jun 9, 2016

mhko commented Jun 10, 2016

akkikiki commented Jun 10, 2016 • edited Loading

mhko commented Jun 16, 2016

akkikiki commented Jun 9, 2016 •

edited

Loading

akkikiki commented Jun 10, 2016 •

edited

Loading