This repository contains the first Western Armenian - English parallel corpus with a total of 52.8k parallel sentences. The data was collected to build the first machine translation system between the said languages. Western Armenian is an endangered language (see: https://unesdoc.unesco.org/ark:/48223/pf0000187026) and one of the standardized variants of Modern Armenian. It is spoken mainly by the individuals of the Armenian Diaspora residing in various countries of the world. The corpus is released with the aim of helping researchers to provide resources for Western Armenian. Resources from both printed media as well as the internet were considered and added to the collection.
The parallel corpus was also a part of the paper The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English, which was presented in SIGUL24, a satellite workshop of LREC-COLING conference that focuses on low-resource languages. Please refer to the paper, if you'd like to know more about the corpus and the neural machine translation model that is trained with the corpus.
The parallel corpus contains the following datasets:
Dataset Name | Domain | # Examples |
---|---|---|
AALW | Correspondences (Formal & Informal) | 2135 |
Bible | Religion | 30604 |
Hamazkayin | News, Art, Literature, Biographies | 10739 |
Hayern Aysor | News, Official | 5422 |
Wikipedia | Biographies, Pop Culture, History, Science | 3979 |
TOTAL | 52879 |
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.