Incorrect Tokenization of a Tibetan custom tokenizer #13671
Unanswered
ykyogoku
asked this question in
Help: Other Questions
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I trained a pipeline (tok2vec, morphologizer) for Tibetan using a custom tokenizer called botok. However, the trained pipeline incorrectly tokenizes certain Tibetan sentences. For example, the following sentence is tokenized incorrectly:
དེ་ནི་སྙན་ངག་གསར་རྩོམ་ལ་འཇུག་པའི་སྤྱིའི་ཐབས་ཚུལ་ཡང་ཡིན།
(de ni snyan ngag gsar rtsom la 'jug pa'i spyi'i thabs tshul yang yin |: transliteration)
The word la (eng. to) and 'jug pa (eng. apply/applied) should be separated, as they are always separated in the training dataset, and botok, the Tibetan tokenizer integrated into the pipeline, correctly tokenizes the same sentence. Similarly, de (eng. that) and ni (topic particle), which should be also separated, are not tokenized correctly.
I wonder if something went wrong in the spaCy training process. Can anyone help identify the cause? I checked similar issues, such as this one, but I could not find an issue that matches this specific case.
My Environment
I tested the pipelines trained in different environments:
spacy 3.6.x and spacy 3.2.x (both under python 3.7.x)
The configuration file is as follows:
Beta Was this translation helpful? Give feedback.
All reactions