Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

This repository contains code for the paper Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora(2024). If you have any questions, please feel free to create a Github issue .

Models

We used fairseq library to train our vanilla transformer models.

Model Name	English -> Sinhala	English -> Tamil
NLLB cleaned translators top_25K	link	link
NLLB cleaned translators complete	link	link

Data

The translator cleaned data is now released,

Dataset Name	Language direction	Link
NLLB Cleaned	English -> Sinhala	link
NLLB Cleaned	English -> Tamil	link

Todo

~~Upload links to data sets~~
~~Update links to models~~
Release code for filtering data using LASER3
Release notes on how to train data in fairseq library

Citation

@misc{ranathunga2024quality,
      title={Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora}, 
      author={Surangika Ranathunga and Nisansa de Silva and Menan Velayuthan and Aloka Fernando and Charitha Rathnayake},
      year={2024},
      eprint={2402.07446},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

Models

Data

Todo

Citation

About

Releases

Packages

nlpcuom/quality-matters

Folders and files

Latest commit

History

Repository files navigation

Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora

Models

Data

Todo

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages