We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tested with kuromoji-core-1.0-SNAPSHOT and kuromoji-ipadic-1.0-SNAPSHOT. (build from master at 2017/3/8)
When the user dictionary is
くろも,くろも,くろも,カスタム名詞 ろ,ろ,ろ,カスタム名詞
, the string "くろもじ" is tokenized into
くろも カスタム名詞,*,*,*,*,*,*,くろも,* じ 助動詞,*,*,*,不変化型,基本形,じ,ジ,ジ
which is fine.
クロモ,クロモ,クロモ,カスタム名詞 ロ,ロ,ロ,カスタム名詞
, the string "クロモジ" is tokenized into
ク 名詞,一般,*,*,*,*,ク,ク,ク ロ カスタム名詞,*,*,*,*,*,*,ロ,* モ *,*,*,*,*,*,*,*,* ジ *,*,*,*,*,*,*,*,*
which is not fine.
I expected below.
クロモ カスタム名詞,*,*,*,*,*,*,クロモ,* ジ *,*,*,*,*,*,*,*,*
What should I do for the expectation?
sample code I used:
public static void main(String[] args) { String target = "くろもじ"; List<String> dictionaryList = Arrays.asList("くろも,くろも,くろも,カスタム名詞", "ろ,ろ,ろ,カスタム名詞"); String target = "クロモジ"; List<String> dictionaryList = Arrays.asList("クロモ,クロモ,クロモ,カスタム名詞", "ロ,ロ,ロ,カスタム名詞"); String dictionary = String.join(System.lineSeparator(), dictionaryList); Builder builder = new Tokenizer.Builder(); try { InputStream inputStream = new ByteArrayInputStream(dictionary.getBytes("utf-8")); builder.userDictionary(inputStream); } catch (Exception e) { } Tokenizer tokenizer = builder.build(); List<Token> tokens = tokenizer.tokenize(target); tokens.stream().forEach(token -> System.out.println(token.getSurface()+"\t"+token.getAllFeatures())); }
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Tested with kuromoji-core-1.0-SNAPSHOT and kuromoji-ipadic-1.0-SNAPSHOT.
(build from master at 2017/3/8)
When the user dictionary is
, the string "くろもじ" is tokenized into
which is fine.
When the user dictionary is
, the string "クロモジ" is tokenized into
which is not fine.
I expected below.
What should I do for the expectation?
sample code I used:
The text was updated successfully, but these errors were encountered: