Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config #944

Open
eu9ene opened this issue Nov 26, 2024 · 3 comments
Assignees
Labels
bug Something is broken or not correct

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Nov 26, 2024

https://firefox-ci-tc.services.mozilla.com/tasks/TZWvhasASyWFpk4KnAfNmw/runs/0/logs/public/logs/live.log

[importers.mono] Wrote 0 out of 200,000,000.
[task 2024-11-26T13:29:04.898Z] [memory] 172.9 MB (+0 B)
[task 2024-11-26T13:29:10.046Z] [importers.mono] Visited 56,070,000,000 lines
[task 2024-11-26T13:29:10.046Z] [importers.mono] Kept 4,656.
[task 2024-11-26T13:29:10.046Z] [importers.mono] Wrote 0 out of 200,000,000.
[task 2024-11-26T13:29:10.046Z] [memory] 172.9 MB (+0 B)
[task 2024-11-26T13:29:12.150Z] [downloads] A download error occurred: ('Connection broken: IncompleteRead(7507370880 bytes read, 1650142606 more expected)', IncompleteRead(7507370880 bytes read, 1650142606 more expected))
[task 2024-11-26T13:29:12.151Z] [downloads] Retrying in 60.0 sec
[task 2024-11-26T13:30:12.212Z] Traceback (most recent call last):
[task 2024-11-26T13:30:12.212Z]   File "/builds/worker/checkouts/vcs/pipeline/data/download-mono.py", line 147, in <module>
[task 2024-11-26T13:30:12.213Z]     main()
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/data/download-mono.py", line 92, in main
[task 2024-11-26T13:30:12.213Z]     download_hplt(
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/data/importers/mono/hplt.py", line 178, in download_hplt
[task 2024-11-26T13:30:12.213Z]     for document_json in document_stream:
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 353, in iter
[task 2024-11-26T13:30:12.213Z]     yield from lines
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 244, in read
[task 2024-11-26T13:30:12.213Z]     chunk = next(self.chunk_iter, None)
[task 2024-11-26T13:30:12.213Z]   File "/builds/worker/checkouts/vcs/pipeline/common/downloads.py", line 329, in download_chunks
[task 2024-11-26T13:30:12.213Z]     raise Exception("The download failed.")
[task 2024-11-26T13:30:12.213Z] Exception: The download failed.
[taskcluster 2024-11-26 13:30:12.451Z] === Task Finished ===
[taskcluster 2024-11-26 13:30:12.674Z] Unsuccessful task run with exit code: 1 completed in 66503.834 seconds
@eu9ene eu9ene added the bug Something is broken or not correct label Nov 26, 2024
@gregtatum
Copy link
Member

The issue here is that the max fluency score was set too high, and it doesn't save any of the lines. The fluency value is much lower for Chinese. I think the fix here would be to compute a good fluency score in the config generator.

url=https://data.hplt-project.org/one/monotext/cleaned/zh/zh_59.jsonl.zst

curl -s $url -o - -L           |
  zstd --decompress --stdout   |
  python max_scores.py
# max_scores.py
import json
import sys

def extract_max_scores(file):
    for line in file:
        try:
            data = json.loads(line)
            max_score = max(data.get("scores", []))
            print(max_score)
        except (json.JSONDecodeError, ValueError):
            print("Invalid JSON line", file=sys.stderr)

if __name__ == "__main__":
    extract_max_scores(sys.stdin)

This outputs:

0.764
0.764
0.764
0.764
0.764
0.764
0.764
0.885
0.764
0.764
0.764
0.764
0.764
0.764
0.764
0.764
0.764
0.764
0.799
0.764
0.764
0.764

@gregtatum gregtatum changed the title dataset-hplt-mono_v1_2-zh failed dataset-hplt-mono_v1_2-zh failed due to a too large fluency score in the config Jan 2, 2025
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jan 2, 2025

Interesting, thanks for investigating! So, it depends on the language... Maybe we should switch to HPLT 2.0 and see what approach they use there for cleaning.

@eu9ene eu9ene mentioned this issue Jan 3, 2025
@ZJaume
Copy link
Collaborator

ZJaume commented Jan 8, 2025

The fluency scores for Chinese are very low due to unknown characters. As I said in #884:

The problem with Chinese is difficult to fix, though. The fluency scoring uses character 7-gram kenlm trained on a few hundred thousand sentences and for Chinese this data does not cover 98% of the characters in the language, like the languages using alphabets. Therefore the unknown characters are penalizing too much the fluency. I tinkered with a kenlm and sentencepiece with byte-fallback, and I think the scores got higher but still was far from being solved. Maybe some sort of byte n-gram model could be a solution, but not sure.

@eu9ene eu9ene self-assigned this Jan 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

No branches or pull requests

3 participants