This repository contains code for the paper Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora(2024). If you have any questions, please feel free to create a Github issue .
We used fairseq library to train our vanilla transformer models.
Model Name | English -> Sinhala | English -> Tamil |
---|---|---|
NLLB cleaned translators top_25K | link | link |
NLLB cleaned translators complete | link | link |
The translator cleaned data is now released,
Dataset Name | Language direction | Link |
---|---|---|
NLLB Cleaned | English -> Sinhala | link |
NLLB Cleaned | English -> Tamil | link |
Upload links to data setsUpdate links to models- Release code for filtering data using LASER3
- Release notes on how to train data in fairseq library
@misc{ranathunga2024quality,
title={Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora},
author={Surangika Ranathunga and Nisansa de Silva and Menan Velayuthan and Aloka Fernando and Charitha Rathnayake},
year={2024},
eprint={2402.07446},
archivePrefix={arXiv},
primaryClass={cs.CL}
}