dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

eu9ene · 2024-11-26T18:35:31Z

https://firefox-ci-tc.services.mozilla.com/tasks/TZWvhasASyWFpk4KnAfNmw/runs/0/logs/public/logs/live.log

[importers.mono] Wrote 0 out of 200,000,000.
[task 2024-11-26T13:29:04.898Z] [memory] 172.9 MB (+0 B)
[task 2024-11-26T13:29:10.046Z] [importers.mono] Visited 56,070,000,000 lines
[task 2024-11-26T13:29:10.046Z] [importers.mono] Kept 4,656.
[task 2024-11-26T13:29:10.046Z] [importers.mono] Wrote 0 out of 200,000,000.
[task 2024-11-26T13:29:10.046Z] [memory] 172.9 MB (+0 B)
[task 2024-11-26T13:29:12.150Z] [downloads] A download error occurred: ('Connection broken: IncompleteRead(7507370880 bytes read, 1650142606 more expected)', IncompleteRead(7507370880 bytes read, 1650142606 more expected))
[task 2024-11-26T13:29:12.151Z] [downloads] Retrying in 60.0 sec
[task 2024-11-26T13:30:12.212Z] Traceback (most recent call last):
[task 2024-11-26T13:30:12.212Z]   File "/builds/worker/checkouts/vcs/pipeline/data/download-mono.py", line 147, in <module>
[task 2024-11-26T13:30:12.213Z]     main()
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/data/download-mono.py", line 92, in main
[task 2024-11-26T13:30:12.213Z]     download_hplt(
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/data/importers/mono/hplt.py", line 178, in download_hplt
[task 2024-11-26T13:30:12.213Z]     for document_json in document_stream:
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 353, in iter
[task 2024-11-26T13:30:12.213Z]     yield from lines
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 244, in read
[task 2024-11-26T13:30:12.213Z]     chunk = next(self.chunk_iter, None)
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 329, in download_chunks
[task 2024-11-26T13:30:12.213Z]     raise Exception("The download failed.")
[task 2024-11-26T13:30:12.213Z] Exception: The download failed.
[taskcluster 2024-11-26 13:30:12.451Z] === Task Finished ===
[taskcluster 2024-11-26 13:30:12.674Z] Unsuccessful task run with exit code: 1 completed in 66503.834 seconds

The text was updated successfully, but these errors were encountered:

gregtatum · 2025-01-02T15:33:11Z

The issue here is that the max fluency score was set too high, and it doesn't save any of the lines. The fluency value is much lower for Chinese. I think the fix here would be to compute a good fluency score in the config generator.

url=https://data.hplt-project.org/one/monotext/cleaned/zh/zh_59.jsonl.zst

curl -s $url -o - -L           |
  zstd --decompress --stdout   |
  python max_scores.py

# max_scores.py
import json
import sys

def extract_max_scores(file):
    for line in file:
        try:
            data = json.loads(line)
            max_score = max(data.get("scores", []))
            print(max_score)
        except (json.JSONDecodeError, ValueError):
            print("Invalid JSON line", file=sys.stderr)

if __name__ == "__main__":
    extract_max_scores(sys.stdin)

This outputs:

eu9ene · 2025-01-02T17:23:59Z

Interesting, thanks for investigating! So, it depends on the language... Maybe we should switch to HPLT 2.0 and see what approach they use there for cleaning.

ZJaume · 2025-01-08T11:16:09Z

The fluency scores for Chinese are very low due to unknown characters. As I said in #884:

The problem with Chinese is difficult to fix, though. The fluency scoring uses character 7-gram kenlm trained on a few hundred thousand sentences and for Chinese this data does not cover 98% of the characters in the language, like the languages using alphabets. Therefore the unknown characters are penalizing too much the fluency. I tinkered with a kenlm and sentencepiece with byte-fallback, and I think the scores got higher but still was far from being solved. Maybe some sort of byte n-gram model could be a solution, but not sure.

eu9ene added the bug Something is broken or not correct label Nov 26, 2024

gregtatum changed the title ~~dataset-hplt-mono_v1_2-zh failed~~ dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config Jan 2, 2025

eu9ene mentioned this issue Jan 3, 2025

Use HPLT 2.0 #884

Open

eu9ene self-assigned this Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

eu9ene commented Nov 26, 2024

gregtatum commented Jan 2, 2025

eu9ene commented Jan 2, 2025

ZJaume commented Jan 8, 2025

dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

Comments

eu9ene commented Nov 26, 2024

gregtatum commented Jan 2, 2025

eu9ene commented Jan 2, 2025

ZJaume commented Jan 8, 2025