Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Word prefix, stem, suffix splitter #7

Open
aodhan-domhnaill opened this issue May 26, 2018 · 0 comments
Open

Word prefix, stem, suffix splitter #7

aodhan-domhnaill opened this issue May 26, 2018 · 0 comments

Comments

@aodhan-domhnaill
Copy link

Hey,

I'm learning Latvian, and I scrap stories off the internet to read for practice. I wrote a word stem splitter that breaks the words up into prefix, stem, suffix. Its a rough hack that maximizes the log likelihood of the cut locations for each word in the corpus. I wrote it to autogenerate vocab lists for myself.

It has various weaknesses, notably the fact that it only splits one prefix so things like "neaiziet" might not get split well. Also, it scales poorly. It scales like O(n^2) because I made the parameters scale with the number of words, but that could be fixed with a better cutting model (eg. RNN, CNN, etc) with fixed parameters.

You can see it does reasonably well. It gave the splittings,

(('mācīt', 'āj', 's'), -49.39630998974996, 5, 7)
(('no', 'nes', 'īšot'), -58.99538681815128, 2, 5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant