Word prefix, stem, suffix splitter #7

aodhan-domhnaill · 2018-05-26T16:07:17Z

Hey,

I'm learning Latvian, and I scrap stories off the internet to read for practice. I wrote a word stem splitter that breaks the words up into prefix, stem, suffix. Its a rough hack that maximizes the log likelihood of the cut locations for each word in the corpus. I wrote it to autogenerate vocab lists for myself.

It has various weaknesses, notably the fact that it only splits one prefix so things like "neaiziet" might not get split well. Also, it scales poorly. It scales like O(n^2) because I made the parameters scale with the number of words, but that could be fixed with a better cutting model (eg. RNN, CNN, etc) with fixed parameters.

You can see it does reasonably well. It gave the splittings,

(('mācīt', 'āj', 's'), -49.39630998974996, 5, 7)
(('no', 'nes', 'īšot'), -58.99538681815128, 2, 5)

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word prefix, stem, suffix splitter #7

Word prefix, stem, suffix splitter #7

aodhan-domhnaill commented May 26, 2018

Word prefix, stem, suffix splitter #7

Word prefix, stem, suffix splitter #7

Comments

aodhan-domhnaill commented May 26, 2018