Add support for parallel data curation #193

shuoyangd · 2024-08-08T21:34:05Z

Description

This PR adds support for parallel data curation. Namely:

A new dataset class ParallelDataset that supports loading and writing parallel data in simple bitext format.
A new ScoreFilter subclass ParallelScoreFilter that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.
A new ScoreFilter subclass JointScoreFilter that allows implementation of filters that takes both fields of the parallel sentence/document pairs.
New heuristic filters: HistogramFilter and LengthRatioFilter.
Adding model-based filters with quality estimation models: QualityEstimationFilter.
Support for two families of quality estimation models: comet and cometoid.
A tutorial for parallel data curation.
Tests accompanying new features.

Joint work at MTMA 2024 with @nverma1.

Usage

See tutorials/bitext_cleaning/main.py.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Shuoyang Ding <[email protected]>

…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]>

…rget Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]>

…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

…date README Signed-off-by: Shuoyang Ding <[email protected]>

…, run black formatter Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

…eakage Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf

Such good work! I have a few additional requests around documentation. Can you create one or more documentation pages about curating parallel datasets in our docs/user-guide/? You can see how it currently is rendered here. Also, please add all the classes and functions you expect a user to use in our API reference

nemo_curator/datasets/__init__.py

nemo_curator/datasets/doc_dataset.py

nemo_curator/utils/distributed_utils.py

nemo_curator/utils/file_utils.py

ryantwolf · 2024-08-16T22:31:02Z

tests/test_read_simple_bitext.py

+        )
+
+        for idx, (src_line, tgt_line) in enumerate(zip(open(src_file), open(tgt_file))):
+            assert ds.df["src"].compute()[idx] == src_line.rstrip("\n")


Nit: instead of calling compute multiple times, call it once and then do all the assert statements.

Interestingly, I get errors when I call it only once:

TypeError: Trying to convert dd.Scalar<series-..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.

tutorials/bitext_cleaning/README.md

tutorials/bitext_cleaning/main.py

ryantwolf · 2024-08-16T22:32:27Z

tutorials/bitext_cleaning/README.md

Ahhh I love this! Thanks for adding a tutorial!

shuoyangd

Thanks for the very thorough review! I left a few questions and comments. mainly concerned about the refactoring you proposed. Meanwhile I'll proceed with the lower-level changes.

nemo_curator/filters/heuristic_filter.py

nemo_curator/modules/filter.py

nemo_curator/utils/distributed_utils.py

Signed-off-by: Shuoyang Ding <[email protected]>

…accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]>

…tern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

…led, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]>

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf

Cool, just a couple of nits. I'll set up a dedicated call to walk through some of the model stuff though, I think that might need a bit more work.

ryantwolf · 2024-10-21T15:29:26Z

nemo_curator/download/__init__.py

@@ -47,17 +47,7 @@
    "import_downloader",
    "import_extractor",
    "import_iterator",
-    "download_common_crawl",


I assume this was modified by mistake, so probably revert it unless there's something I'm missing.

ryantwolf · 2024-10-21T15:48:25Z

nemo_curator/utils/distributed_utils.py

+def _single_partition_write_to_simple_bitext(out_df, output_file_path):
+    src_output_file_path = output_file_path + f".{out_df['src_lang'].iloc[0]}"
+    tgt_output_file_path = output_file_path + f".{out_df['tgt_lang'].iloc[0]}"
+    with open(src_output_file_path, "w+") as src_out, open(


Just checking, has this been tested with multiple partitions? I just want to ensure no race condition happens when you have multiple workers writing to the same file.

ryantwolf · 2024-10-21T15:50:01Z

nemo_curator/filters/bitext_filter.py

+    def __call__(
+        self,
+        dataset: ParallelDataset,
+        metadata_field_name_mapping: Dict[str, str] = {},


Nit: Canmetadata_field_name_mapping be moved to the __init__ method? I want to have the __call__ method be exclusively for the dataset so it's super easy to chain the filters together with nemo_curator.Sequential.

ryantwolf · 2024-10-21T15:51:17Z

nemo_curator/filters/bitext_filter.py

+        return self._score_bitext(**kwargs)
+
+    @abstractmethod
+    def _score_bitext(self, src, tgt, **kwargs):


Nit: To keep with the convention of DocumentFilter, have rename these methods to score_bitext and keep_bitext.

shuoyangd and others added 30 commits August 8, 2024 14:28

add data interface to read simple bitext

c7a6423

Signed-off-by: Shuoyang Ding <[email protected]>

adding ParallelScoreFilter

4b3dc97

Signed-off-by: Shuoyang Ding <[email protected]>

add test for ParallelScoreFilter, small style change for ParallelData…

114716e

…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]>

allow ParallelScoreFilter to take different filters for source and ta…

cbab143

…rget Signed-off-by: Shuoyang Ding <[email protected]>

add JointScoreFilter and LengthRatioFilter

82f5486

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] add heuristic filter w/o test

f9a0535

Signed-off-by: Shuoyang Ding <[email protected]>

merge with main

8f25988

Signed-off-by: Shuoyang Ding <[email protected]>

add test for histogram filter, fix a few bugs

612249c

Signed-off-by: Shuoyang Ding <[email protected]>

length ratio, joint score filter testing

2fe4973

Signed-off-by: Shuoyang Ding <[email protected]>

fix typing in joint test

b61d7f1

Signed-off-by: Shuoyang Ding <[email protected]>

add a fake comet qe filter as an initial step

f63a1f9

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] adding bitext cleaning tutorial

76bced7

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] fixing example

1a2bb1e

Signed-off-by: Shuoyang Ding <[email protected]>

fix slow histogram filter, fix faulty bitext loading

74698d5

Signed-off-by: Shuoyang Ding <[email protected]>

tutorial running

bf2e6ac

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] documentation of bitext tutorial

62d1242

Signed-off-by: Shuoyang Ding <[email protected]>

add tested version of comet-qe filter

c413ea2

Signed-off-by: Shuoyang Ding <[email protected]>

fix ParallelDataset bug where single file name is not accepted, and d…

5a90038

…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]>

add docstring to explain simple bitext format, fix a bug where file e…

f8046dd

…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]>

remove print line for debug

6c7aea4

Signed-off-by: Shuoyang Ding <[email protected]>

add comet filter to tutorial

a457995

Signed-off-by: Shuoyang Ding <[email protected]>

refactor COMET QE filter to decouple model from filter, make sure Joi…

c5a6f1c

…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]>

use refactored qe filter

61713e4

Signed-off-by: Shuoyang Ding <[email protected]>

wrap_qe_input should be a static method

a4d2bb3

Signed-off-by: Shuoyang Ding <[email protected]>

use conditional import for comet, formatting changes

0674400

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] add cometoid

6936f9a

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] attempt to resolve device conflict but is failing

da96d29

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] playing with cometoid arguments

14b7d70

Signed-off-by: Shuoyang Ding <[email protected]>

[WIP] -d 0 doesn't look necessary

b02b56d

Signed-off-by: Shuoyang Ding <[email protected]>

tested arguments for Cometoid

6c1e719

Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd added 7 commits August 8, 2024 14:28

use proper safe import, make sure test doesn't crash sans comet/pymarian

70a7fe8

Signed-off-by: Shuoyang Ding <[email protected]>

falling back to comet for tutorial since that's easier to set up, upp…

c66d7f9

…date README Signed-off-by: Shuoyang Ding <[email protected]>

give credit to original fairseq implementation of histogram filtering…

861bd4d

…, run black formatter Signed-off-by: Shuoyang Ding <[email protected]>

fix pre-commit complaint

52ba08e

Signed-off-by: Shuoyang Ding <[email protected]>

fix small bug

62c254b

Signed-off-by: Shuoyang Ding <[email protected]>

fix another occurrence of the same bug

91ea9fa

Signed-off-by: Shuoyang Ding <[email protected]>

introduce shard limit to a single PyMarian API call to avoid memory l…

12783ec

…eakage Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf self-requested a review August 15, 2024 18:11

shuoyangd added 2 commits August 16, 2024 11:10

repartition after reading simple bitext data

a65588a

Signed-off-by: Shuoyang Ding <[email protected]>

-d 0 is actually needed for pymarian

3f1d09b

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf requested changes Aug 16, 2024

View reviewed changes

shuoyangd commented Aug 27, 2024

View reviewed changes

nemo_curator/filters/heuristic_filter.py Outdated Show resolved Hide resolved

nemo_curator/modules/filter.py Show resolved Hide resolved

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved

remove duplicate LengthRatioFilter definition

102429a

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf assigned shuoyangd Sep 16, 2024

refactor repeated code segment in file writing, change classifier to …

8a367dd

…accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd force-pushed the main branch from 8cef914 to 09273c9 Compare September 20, 2024 22:58

[WIP] addressed comments in NVIDIA#193 apart from resolving .iloc pat…

396d7ba

…tern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd force-pushed the main branch from 09273c9 to 396d7ba Compare September 20, 2024 23:00

shuoyangd added 6 commits October 1, 2024 12:20

refactor to resolve .loc pattern, test passing

eb4f4df

Signed-off-by: Shuoyang Ding <[email protected]>

add missing file

3addf44

Signed-off-by: Shuoyang Ding <[email protected]>

revert changes in setup.py

a14a78a

Signed-off-by: Shuoyang Ding <[email protected]>

fix a small bug in parallel dataset, explain why repartition is disab…

6b8dfa0

…led, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]>

add api guide, small change on bitext/parallel score filter docstring

bb4f148

Signed-off-by: Shuoyang Ding <[email protected]>

fix read_simple_bitext test issues

d309744

Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd requested a review from ryantwolf October 1, 2024 22:50

Merge branch 'main' into main

21676bd

Signed-off-by: Shuoyang Ding <[email protected]>

shuoyangd force-pushed the main branch from e86e9df to 21676bd Compare October 2, 2024 04:42

reinstate dependencies lost during merging

7797925

Signed-off-by: Shuoyang Ding <[email protected]>

ryantwolf requested changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for parallel data curation #193

Add support for parallel data curation #193

shuoyangd commented Aug 8, 2024

ryantwolf left a comment

ryantwolf Aug 16, 2024

shuoyangd Oct 1, 2024

ryantwolf Aug 16, 2024

shuoyangd left a comment

ryantwolf left a comment

ryantwolf Oct 21, 2024

ryantwolf Oct 21, 2024

ryantwolf Oct 21, 2024

ryantwolf Oct 21, 2024

Add support for parallel data curation #193

Are you sure you want to change the base?

Add support for parallel data curation #193

Conversation

shuoyangd commented Aug 8, 2024

Description

Usage

Checklist

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf Aug 16, 2024

Choose a reason for hiding this comment

shuoyangd Oct 1, 2024

Choose a reason for hiding this comment

ryantwolf Aug 16, 2024

Choose a reason for hiding this comment

shuoyangd left a comment

Choose a reason for hiding this comment

ryantwolf left a comment

Choose a reason for hiding this comment

ryantwolf Oct 21, 2024

Choose a reason for hiding this comment

ryantwolf Oct 21, 2024

Choose a reason for hiding this comment

ryantwolf Oct 21, 2024

Choose a reason for hiding this comment

ryantwolf Oct 21, 2024

Choose a reason for hiding this comment