Skip to content

Commit

Permalink
Cjk corpora fixes (mozilla#937)
Browse files Browse the repository at this point in the history
* Remove num_mismatch filter for Chinese

It is removing a lot of sentences that have exactly the same numbers.
Seems that is happenning when the Chinese side has the numbers glued to
the Chinese characters.

* Remove displaystyle in WikiMatrix

* Normalize to full-width punctuation

* Do not omit the character preceeding the period

---------

Co-authored-by: Evgeny Pavlov <[email protected]>
  • Loading branch information
ZJaume and eu9ene authored Nov 25, 2024
1 parent 3598c78 commit 9e8641b
Show file tree
Hide file tree
Showing 4 changed files with 69 additions and 16 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"version": 1,
"files": [
],
"filters": [
{
"filter": "regexp",
"parameters": {
"PATTERN": "s#\\{(?:[^{}]++|(?R))*+\\}##g"
},
"language": "<src>"
},
{
"filter": "regexp",
"parameters": {
"PATTERN": "s#\\{(?:[^{}]++|(?R))*+\\}##g"
},
"language": "<trg>"
}
]
}
35 changes: 27 additions & 8 deletions pipeline/clean/opuscleaner/configs/en-zh/default.filters.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,33 @@
},
"language": "<src>"
},
{
"_comment": "Normalize to full-width punctuation",
"filter": "opus.RegExpSub",
"parameters": {
"patterns": [
[
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\?",
"\\1\uff1f",
0,
""
],
[
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff!\uff01\uff1f\\?])\\!",
"\\1\uff01",
0,
""
],
[
"([\\u3400-\\u4dbf\\u4e00-\\u9fff\\uf900-\\ufaff])\\.\\s*(?!\\s*\\.)",
"\\1\uff61",
0,
""
]
]
},
"language": null
},
{
"filter": "max_length",
"parameters": {
Expand Down Expand Up @@ -56,14 +83,6 @@
},
"language": null
},
{
"filter": "num_mismatch",
"parameters": {
"RATIO": 1,
"DEBUG": false
},
"language": null
},
{
"filter": "fasttext_filter",
"parameters": {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
{
"version": 1,
"files": [
],
"filters": [
{
"filter": "regexp",
"parameters": {
"PATTERN": "s#\\{(?:[^{}]++|(?R))*+\\}##g"
},
"language": "<src>"
},
{
"filter": "regexp",
"parameters": {
"PATTERN": "s#\\{(?:[^{}]++|(?R))*+\\}##g"
},
"language": "<trg>"
}
]
}
8 changes: 0 additions & 8 deletions pipeline/clean/opuscleaner/configs/zh-en/default.filters.json
Original file line number Diff line number Diff line change
Expand Up @@ -56,14 +56,6 @@
},
"language": null
},
{
"filter": "num_mismatch",
"parameters": {
"RATIO": 1,
"DEBUG": false
},
"language": null
},
{
"filter": "fasttext_filter",
"parameters": {
Expand Down

0 comments on commit 9e8641b

Please sign in to comment.