Skip to content

Indian Language Tagger and Chunker (Hindi, Telugu, Tamil, Marathi, Punjabi, Kanada, Malayalam, Urdu, Bengali)

License

Notifications You must be signed in to change notification settings

avineshpvs/indic_tagger

Repository files navigation

Indic Tagger (Indian Language Tagger)

In this project, we build part-of-speech (POS) taggers and chunkers for Indian Languages.

Languages supported: Telugu (te), Hindi (hi), Tamil (ta), Marathi (mr), Punjabi (pa), Kannada (kn), Malayalam (ml), Urdu (ur), Bengali (bn)

If you reuse this software, please use the following citation:

@inproceedings{PVS:SPSAL2007,
  editor    = {P.V.S., Avinesh and Gali, Karthik},
  title     = {Part of Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning}
  booktitle = {Proceedings of the  Shallow Parsing for South Asian Languages (SPSAL) Workshop, held at IJCAI-07, Hyderabad, India},
  series    = {{SPSAL} Workshop Proceedings},
  month     = {January},
  year      = {2007},
  pages     = {21--24},
}

Training Data Statistics and System Performances (F1 macro)

Languages # Words # Sents CRF POS CRF Chunk BI-LSTM-CRF POS BI-LSTM CRF Chunk
te 347k 30k 93% 96% 92% 92%
hi 350k 16.3k 93% 97% 94% 93%
bn 298.3k 14.6k 84% 95% 85% 88%
pa 152.5k 5.6k 92% 98% 94% 96%
mr 207.9k 8.5k 89% 95% 88% 90%
ur 158.9k 7.6k 90% 96% 92% 89%
ta 337k 14.2k 88% 92% 87% 85%
ml 192k 11.4k 96% 95% 98% 98%
kn 294.3k 16.5k 90% 98% 88% 87%

Training Data Statistics and System Performances (F1 macro) for NER

Languages # Words # Sents CRF NER BI-LSTM-CRF NER
te 347k 30k 69% 65%
hi 503k 19k 62% 63%
bn 120k 6k 54% 48%
ur 35k 1.5k 65% 56%
or 93k 1.8k 68% 43%

Install using Anaconda

    # INSTALL python environment
    conda create -n tagger3.6 anaconda python=3.6
    source activate tagger3.6
    
    # Install the tokenizer
    cd polyglot-tokenizer
    python setup.py install
    
    # Install requirements
    pip install -r requirements.txt

Run

    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i input_file -o output_file

    -l, --languages       select language (2 letter ISO-639 code) 
                          {hi, be, ml, pu, te, ta, ka, mr, ur}
    -t, --tag_type      	pos, chunk, parse, ner
    -m, --model_type    	crf, hmm, lstm
    -f, --data_format   	ssf, txt, conll
    -e, --encoding      	utf8, wx   (default: utf8)
    -i, --input_file      <input-file>
    -o, --output_file     <output-file>
    -s, --sent_split      True/False (default: True)
	
    python pipeline.py --help 

Train the POS tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t pos -m crf -e utf -f ssf
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t pos -f conll -m lstm -e utf -l te

Predict on text:

    # CRF models 
    python pipeline.py -p predict -l te -t pos -m crf -f txt -e utf -i data/test/te/test.utf.txt
    
    # BI-LSTM-CRF models
    python pipeline.py -p predict -l te -t pos -m lstm -f txt -e utf -i data/test/te/test.utf.txt
    
    # SpaCy models
    python spacy_tagger_test.py -l te -t pos

Train the NER tagger:

    # CRF model
    python pipeline.py -p train -o outputs -l te -t ner -m crf -e utf -f conll
    
    # BI-LSTM-CRF model
    python pipeline.py -p train -t ner -f conll -m lstm -e utf -l te

Predict NER on text:

    # CRF model
    python pipeline.py -p predict -l hi -t ner -m crf -f txt -e utf -i data/test/hi/test.utf.txt
    
    # BI-LSTM-CRF model
    python pipeline.py -p predict -l hi -t ner -m lstm -f txt -e utf -i data/test/hi/test.utf.txt

ToDo List

  • Telugu, Hindi trained CRF models
  • Bengali, Punjabi, Marathi, Urdu, Tamil trained CRF models
  • Bug: Utf-8 error Malayalam, Kannada trained CRF models
  • Deep learning (BI-LSTM-CRF)
  • Analysis Comparision w.r.t other ML algorithms
  • Bug: Punjabi & Urdu training file doesn't have "|" (or) end of sentence marker.
  • NER for Indian Languages
  • Feature addition to BI-LSTM-CRF models
  • Active Learning based sampling strategies