-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalized surface in user dictionary. #126
Comments
Thanks! Could you give an example of what kind of normalisations you'd like to see? I'm wondering if we might already support it in the full/expanded user dictionary format in 1.0-SNAPSHOT. |
Token class has the getBaseForm method I regard as a kind of surface normalization as you know well. by using this method, we can get normalized surface if we register base form for each morpheme.
but, on the current implementations of ipadic user dictionary, it seems that there are no means to register the base form for each user dictionary word. instead of base form , we can register the reading and splitted surface. |
in current implementation about UserDictionary, SIMPLE_USERDICT_FIELDS is set to 4 as follows
simple userdict fields means the following fields
i think that base form is more needed in usual usecases. or letting the segmentationValue handle base form. |
length of the segmentationValue without spaces must equals to the length of surface because term splitting is executed by using offset and length of splitted word ? |
|
in current implementations of the ipadic, it seems that there is no functionality to normalize surface in the user dic.
is this right?
i think that this functionality is very useful and required in common situations.
so, i have a plan to expand user dictionary function to handle normalize a word surface with keeping the current specification of the user dictionary resource format.
what do you think about this?
The text was updated successfully, but these errors were encountered: