This repository contains links to Belarusian Natural Language and Speech Processing resources and datasets.
It is inspired by similar project with Ukrainian Speech Processing resources: egorsmkv/speech-recognition-uk
- add detailed descriptions to each of list items
- evaluate models on benchmarks and log their performance
-
wav2vec2 trained on Common Voice 8 + kenlm language model trained on Common Voice 8:
- Model: ales/wav2vec2-cv-be
- Demo: ales/wav2vec2-cv-be-lm
- Code: navalnica/wav2vec2-belarusian
-
whisper:
- original openai/whisper models
- Whisper models fine-tuned on Belarusian Common Voice 11 dataset:
- Whisper Small:
- Model: ales/whisper-small-belarusian
- test WER on CommonVoice11:
6.79
- Demo: ales/whisper-small-belarusian-demo
- Code: navalnica/whisper-finetuning-be
- Whisper Base:
- Whisper Small:
-
Nvidia NeMo models:
- nvidia/stt_be_conformer_ctc_large
- [huggingface self-reported metric] test WER on CommonVoice10:
4.8
- [huggingface self-reported metric] test WER on CommonVoice10:
- nvidia/stt_be_conformer_transducer_large
- [huggingface self-reported metric] test WER on CommonVoice10:
3.8
- [huggingface self-reported metric] test WER on CommonVoice10:
- nvidia/stt_be_fastconformer_hybrid_large_pc
- [huggingface self-reported metric] test WER on CommonVoice12:
2.72
- [huggingface self-reported metric] test WER P&C CommonVoice12:
3.87
- [huggingface self-reported metric] test WER on CommonVoice12:
- nvidia/stt_be_conformer_ctc_large
-
ESPnet:
Model comparisons grouped by dataset. TODO
- Common Voice. Speech recognition dataset
- Dataset from knihi.com. TODO: what is the type of dataset?
- google/fleurs
- ssrlab: TODO. Speech recognition dataset
- CoquiAI implementations
- jhlfrfufyfn/bel-tts. GlowTTS + HifiGan
- Code
- Model
- Demo on HuggingFace
- Demo on a custom web-page. The source code for the demo page: here
- alex73/belarusian-tts. CoquiAI implementation by Yurii Paniv (@robinhad).
Original repo & models were deleted - only fork is available now
- jhlfrfufyfn/bel-tts. GlowTTS + HifiGan
- KoichiYasuoka/roberta-small-belarusian-upos
- stanfordnlp/stanza-be
- poritski/YABC_Tagger. Rule-based POS-tagger and lemmatizer.
Written in Perl. Uses poritski/YABC as a Grammar base (?) - volchek/beltagger.
An improved version of poritski/YABC_Tagger rule-based POS-tagger and lemmatizer.
Cross-platform, written in C++.
Known issues:- requires input data to be incoded in Windows-1251, does not support UTF-8;
- tagset is not fully-compatible with BNKorpus's tagset and grammar base
- grammar base used is not full enough. Belarus/GrammarDB is a better paradigms source but is not incorporated yet
- suffix table calculation script is not ported from Perl to C++
- code uses Boost libarary
- pkasila/bel-sklony - web page with Belarusian nouns declension. Demo: sklony.pkasila.net
- oscar
- mc4
- poritski/YABC - Эксперыментальны корпус беларускай мовы, ЭКБМ
- Belarus/GrammarDB - Grammar Database of Belarusian language
- tsimafeip/Translator - Dataset with russian-belarusian translation pairs
- Universal dependencies dataset:
- Tatoeba Belarusian sentences
- corpus.by
- ssrlab.by
- bnkorpus.info
- Belarus organization on github
- nlproc.by community on github
- nothing for now