You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue here is that the max fluency score was set too high, and it doesn't save any of the lines. The fluency value is much lower for Chinese. I think the fix here would be to compute a good fluency score in the config generator.
gregtatum
changed the title
dataset-hplt-mono_v1_2-zh failed
dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config
Jan 2, 2025
Interesting, thanks for investigating! So, it depends on the language... Maybe we should switch to HPLT 2.0 and see what approach they use there for cleaning.
The fluency scores for Chinese are very low due to unknown characters. As I said in #884:
The problem with Chinese is difficult to fix, though. The fluency scoring uses character 7-gram kenlm trained on a few hundred thousand sentences and for Chinese this data does not cover 98% of the characters in the language, like the languages using alphabets. Therefore the unknown characters are penalizing too much the fluency. I tinkered with a kenlm and sentencepiece with byte-fallback, and I think the scores got higher but still was far from being solved. Maybe some sort of byte n-gram model could be a solution, but not sure.
https://firefox-ci-tc.services.mozilla.com/tasks/TZWvhasASyWFpk4KnAfNmw/runs/0/logs/public/logs/live.log
The text was updated successfully, but these errors were encountered: