Skip to content

Latest commit

 

History

History
29 lines (21 loc) · 2.3 KB

README.md

File metadata and controls

29 lines (21 loc) · 2.3 KB

Western Armenian - English Parallel Corpus

This repository contains the first Western Armenian - English parallel corpus with a total of 52.8k parallel sentences. The data was collected to build the first machine translation system between the said languages. Western Armenian is an endangered language (see: https://unesdoc.unesco.org/ark:/48223/pf0000187026) and one of the standardized variants of Modern Armenian. It is spoken mainly by the individuals of the Armenian Diaspora residing in various countries of the world. The corpus is released with the aim of helping researchers to provide resources for Western Armenian. Resources from both printed media as well as the internet were considered and added to the collection.

The parallel corpus was also a part of the paper The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English, which was presented in SIGUL24, a satellite workshop of LREC-COLING conference that focuses on low-resource languages. Please refer to the paper, if you'd like to know more about the corpus and the neural machine translation model that is trained with the corpus.

The parallel corpus contains the following datasets:

Dataset Name Domain # Examples
AALW Correspondences (Formal & Informal) 2135
Bible Religion 30604
Hamazkayin News, Art, Literature, Biographies 10739
Hayern Aysor News, Official 5422
Wikipedia Biographies, Pop Culture, History, Science 3979
TOTAL 52879

License

Shield: CC BY-NC-SA 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0