-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer is not serializable for Apache Spark #85
Comments
Thanks a lot Fujikawa-san. Instantiating Kuromoji takes a bit of time since it reads a fairly large dictionaries into memory. Could you clarify how making them serializable would help this in the context of Spark? I just don't know the detailed mechanisms and I'd appreciate if you could explain. Thanks! |
Spark serialize whole class at the beginning and then process it by each machine parallelly. |
I've tried to make kuromoji-core classes Serializable but been not to able to serialize Tokenizer because java.nio.HeapByteBuffer is unserializable. This work may take a lot of trouble |
This is changes I made(Sorry, unnecessary space diff included) by using my tool |
I was looking into "Tuning Spark" document on Spark 1.2.0 and there is a section mentioning that using serialization will help reduce the memory usage on Spark. It is interesting that there is also a downside on this:
|
@akkikiki By the way, I think we all understand Japanese and it's no problem to write in Japanese, isn't it? |
Sorry I'm not familiar to Kuromoji but I think Kuromoji reads dictionary file when processing and it is not suited to Serializable. If Kuromoji has new mode to contain all data in memory, it become Serializable, I think. |
On Apache Spark, instances must be serializable for parallel processing but kuromoji tokenizers are not and must be initialized for each time.
If tokenizers are serializable, we can decrease processing time.
The text was updated successfully, but these errors were encountered: