Skip to content

Identify if each of the words in a Persian sentence need a kasr-e-ezafeh tag or not.

License

Notifications You must be signed in to change notification settings

mahvash-siavashpour/BERT-Token-Classification-for-Persian-Kasr-e-Ezafeh

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 

Repository files navigation

BERT Token Tagger for Persian Kasr-e Ezafeh

Each persion word in a sentance either has a Kasr-e Eafeh or it does not. Kas-e Ezafeh comes in 3 different forms as shown bellow:

  • @e
  • e
  • ve
  • y
  • ye

Using a BERT token classification model from pretrained HooshvareLab/distilbert-fa-zwnj-base enables us to train the model on this specific task. The dataset could not be published due to copyright but this is a sample of what it looks like:

سلسله	ye
ایرانی	O
در	O
سوادکوه	e
مازندران	O
حکومت	O
می‌کردند	O
.	O
نفوذ	e
برامکه	O
)	O
وزرای	ye
ایرانی	O
(	O
در	O
دربار	e
عباسی	O

Each token of the data has a label indicating its Kasr-e Ezafeh. Tokens without a tag are labeled as O .

Training

The model is trained using the Huggingface library with a pretrained bert model from HooshvareLab/distilbert-fa-zwnj-base .
After preprocessing the data and tokenizing, it is fed into the model. The model is trained in 3 epochs with a batch size of 16.

Results

The results during the training phase after each epoch are shown bellow:

Epoch F1 Accuracy
1 0.930487 0.985994
2 0.939853 0.987362
3 0.940642 0.987561

About

Identify if each of the words in a Persian sentence need a kasr-e-ezafeh tag or not.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published