Western Armenian - English Parallel Corpus

This repository contains the first Western Armenian - English parallel corpus with a total of 52.8k parallel sentences. The data was collected to build the first machine translation system between the said languages. Western Armenian is an endangered language (see: https://unesdoc.unesco.org/ark:/48223/pf0000187026) and one of the standardized variants of Modern Armenian. It is spoken mainly by the individuals of the Armenian Diaspora residing in various countries of the world. The corpus is released with the aim of helping researchers to provide resources for Western Armenian. Resources from both printed media as well as the internet were considered and added to the collection.

The parallel corpus was also a part of the paper The First Parallel Corpus and Neural Machine Translation Model of Western Armenian and English, which was presented in SIGUL24, a satellite workshop of LREC-COLING conference that focuses on low-resource languages. Please refer to the paper, if you'd like to know more about the corpus and the neural machine translation model that is trained with the corpus.

The parallel corpus contains the following datasets:

Dataset Name	Domain	# Examples
AALW	Correspondences (Formal & Informal)	2135
Bible	Religion	30604
Hamazkayin	News, Art, Literature, Biographies	10739
Hayern Aysor	News, Official	5422
Wikipedia	Biographies, Pop Culture, History, Science	3979
	TOTAL	52879

License

Shield:

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Western Armenian - English Parallel Corpus

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Western Armenian - English Parallel Corpus

License