Neural language identification and normalisation in code switching data tailored with a three-step decoding process
pip install -r requirements.txt python build_viterbi.py build_ext --inplace
Download models from csnli-models.
bunzip2 lm/* bunzip2 dicts/* bunzip2 lid_models/* bunzip2 nmt_models/*
>>> from three_step_decoding import *
>>> tsd = ThreeStepDecoding('lid_models/hinglish', htrans='nmt_models/rom2hin.pt', etrans='nmt_models/eng2eng.pt')
>>> print '\n'.join(['\t'.join(x) for x in tsd.tag_sent(u'i thght mosam dfrnt hoga bs fog h')])
i i en
thght thought en
mosam मौसम hi
dfrnt different en
hoga होगा hi
bs बस hi
fog fog en
h है hi
>>> print '\n'.join(['\t'.join(x) for x in tsd.tag_sent(u'kafi dprsng situation hai yar')])
kafi काफी hi
dprsng depressing en
situation situation en
hai है hi
yar यार hi
>>> from lang_tagger import *
>>> lid = LID(model='lid_models/hinglish', etrans='nmt_models/eng2eng.pt', htrans='nmt_models/rom2hin.pt')
>>> lid.tag_sent(u'i thght mosam dfrnt hoga bs fog h'.split())
[(u'i', u'en'), (u'thght', u'en'), (u'mosam', u'hi'), (u'dfrnt', u'en'), (u'hoga', u'hi'), (u'bs', u'hi'), (u'fog', u'en'), (u'h', u'hi')]
>>> lid.tag_sent(u'kafi dprsng situation hai yar'.split())
[(u'kafi', u'hi'), (u'dprsng', u'en'), (u'situation', u'en'), (u'hai', u'hi'), (u'yar', u'hi')]
python lang_tagger.py --test test_file --load lid_models/hinglish --etrans nmt_models/eng2eng.pt --htrans nmt_models/rom2hin.pt --out output_file python three_step_decoding.py --test test_file --lid lid_models/hinglish --etrans nmt_models/eng2eng.pt --htrans nmt_models/rom2hin.pt --out output_file
python lang_tagger.py --help Language Identification System optional arguments: -h, --help show this help message and exit --dynet-seed SEED --train TRAIN CONLL/TNT Train file --dev DEV CONLL/TNT Dev/Test file --test TEST Raw Test file --eng-pretrained-embd EEMBD Pretrained word2vec Embeddings --hin-pretrained-embd HEMBD Pretrained word2vec Embeddings --elimit ELIMIT load top-n English word vectors (default=all vectors, recommended=400k) --hlimit HLIMIT load top-n Hindi word vectors (default=all vectors, recommended=200k) --trainer TRAINER Trainer [cysgd|momsgd|adam|adadelta|adagrad|amsgrad] --activation-fn ACT_FN Activation function [tanh|relu|sigmoid] --iter ITER No. of Epochs --bvec BVEC 1 if binary embedding file else 0 --etrans ETRANS OpenNMT English Transliteration Model --htrans HTRANS OpenNMT Hindi Transliteration Model --save-model SAVE_MODEL Specify path to save model --load-model LOAD_MODEL Load Pretrained Model --output-file OFILE Output File
Any publication reporting the work done using this data should cite the following papers:
@inproceedings{bhat2017joining, title={Joining Hands: Exploiting Monolingual Treebanks for Parsing of Code-mixing Data}, author={Bhat, Irshad and Bhat, Riyaz A and Shrivastava, Manish and Sharma, Dipti}, booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers}, volume={2}, pages={324--330}, year={2017} } @inproceedings{bhat20`18universal, title={Universal Dependency Parsing for Hindi-English Code-Switching}, author={Bhat, Irshad and Bhat, Riyaz A and Shrivastava, Manish and Sharma, Dipti}, booktitle={Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)}, volume={1}, pages={987--998}, year={2018} }
Irshad Ahmad Bhat MS-CSE IIITH, Hyderabad [email protected] [email protected]