You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.
In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.
The headroom is valuable, but not massive. Around 15% of the time is spent in Fst.lookup. One can hope to cut this bit in half.
The text was updated successfully, but these errors were encountered:
The largest optimization opportunity is probably isKanjiOnly. (40% is spent in this function and it is possible to bring it down to 0%).
The useless call to substring may also be wasting CPU Time (> 5%). It is hard to measure accurately however as they have hidden impact as well. (it does copy the original string since Java 7.0)
@cmoen Might be of your interest.
I do not know the current Lucene's implementation on it, but it is doable by e.g., having a separate function in FST.java that returns the indices of all prefixes which exist in the dictionary.
Please take this report with a big pinch of salt : I am not even a kuromoji user and I did not profile the code thoroughly.
In ViterbiBuilder, kuromoji uses an fst to search for all possible prefix of a given string that are within a dictionary (encoded as the fst).
The successive call to lookup however, restart from the rootnode of the fst. It would be advisable to get all of the prefix in a single browse of the fst.
The headroom is valuable, but not massive. Around 15% of the time is spent in
Fst.lookup
. One can hope to cut this bit in half.The text was updated successfully, but these errors were encountered: