You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm learning Latvian, and I scrap stories off the internet to read for practice. I wrote a word stem splitter that breaks the words up into prefix, stem, suffix. Its a rough hack that maximizes the log likelihood of the cut locations for each word in the corpus. I wrote it to autogenerate vocab lists for myself.
It has various weaknesses, notably the fact that it only splits one prefix so things like "neaiziet" might not get split well. Also, it scales poorly. It scales like O(n^2) because I made the parameters scale with the number of words, but that could be fixed with a better cutting model (eg. RNN, CNN, etc) with fixed parameters.
You can see it does reasonably well. It gave the splittings,
Hey,
I'm learning Latvian, and I scrap stories off the internet to read for practice. I wrote a word stem splitter that breaks the words up into prefix, stem, suffix. Its a rough hack that maximizes the log likelihood of the cut locations for each word in the corpus. I wrote it to autogenerate vocab lists for myself.
It has various weaknesses, notably the fact that it only splits one prefix so things like "neaiziet" might not get split well. Also, it scales poorly. It scales like O(n^2) because I made the parameters scale with the number of words, but that could be fixed with a better cutting model (eg. RNN, CNN, etc) with fixed parameters.
You can see it does reasonably well. It gave the splittings,
The text was updated successfully, but these errors were encountered: