Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parallel data curation #193

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Conversation

shuoyangd
Copy link

Description

This PR adds support for parallel data curation. Namely:

  • A new dataset class ParallelDataset that supports loading and writing parallel data in simple bitext format.
  • A new ScoreFilter subclass ParallelScoreFilter that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.
  • A new ScoreFilter subclass JointScoreFilter that allows implementation of filters that takes both fields of the parallel sentence/document pairs.
  • New heuristic filters: HistogramFilter and LengthRatioFilter.
  • Adding model-based filters with quality estimation models: QualityEstimationFilter.
  • Support for two families of quality estimation models: comet and cometoid.
  • A tutorial for parallel data curation.
  • Tests accompanying new features.

Joint work at MTMA 2024 with @nverma1.

Usage

See tutorials/bitext_cleaning/main.py.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

shuoyangd and others added 30 commits August 8, 2024 14:28
Signed-off-by: Shuoyang Ding <[email protected]>
…set test, fix a few data and import bugs

Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial

Signed-off-by: Shuoyang Ding <[email protected]>
…xtensions are removed twice before writing

Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…ntScoreFilter can take more than one fields for source and target

Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
@ryantwolf ryantwolf self-requested a review August 15, 2024 18:11
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such good work! I have a few additional requests around documentation. Can you create one or more documentation pages about curating parallel datasets in our docs/user-guide/? You can see how it currently is rendered here. Also, please add all the classes and functions you expect a user to use in our API reference

nemo_curator/datasets/__init__.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/file_utils.py Outdated Show resolved Hide resolved
)

for idx, (src_line, tgt_line) in enumerate(zip(open(src_file), open(tgt_file))):
assert ds.df["src"].compute()[idx] == src_line.rstrip("\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: instead of calling compute multiple times, call it once and then do all the assert statements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, I get errors when I call it only once:

TypeError: Trying to convert dd.Scalar<series-..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.

tutorials/bitext_cleaning/README.md Outdated Show resolved Hide resolved
tutorials/bitext_cleaning/main.py Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh I love this! Thanks for adding a tutorial!

Copy link
Author

@shuoyangd shuoyangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough review! I left a few questions and comments. mainly concerned about the refactoring you proposed. Meanwhile I'll proceed with the lower-level changes.

nemo_curator/filters/heuristic_filter.py Outdated Show resolved Hide resolved
nemo_curator/modules/filter.py Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
…accomodate custom field names, pause doc repartition since it causes problems

Signed-off-by: Shuoyang Ding <[email protected]>
…tern, test currently failing

Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Copy link
Collaborator

@ryantwolf ryantwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, just a couple of nits. I'll set up a dedicated call to walk through some of the model stuff though, I think that might need a bit more work.

@@ -47,17 +47,7 @@
"import_downloader",
"import_extractor",
"import_iterator",
"download_common_crawl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was modified by mistake, so probably revert it unless there's something I'm missing.

def _single_partition_write_to_simple_bitext(out_df, output_file_path):
src_output_file_path = output_file_path + f".{out_df['src_lang'].iloc[0]}"
tgt_output_file_path = output_file_path + f".{out_df['tgt_lang'].iloc[0]}"
with open(src_output_file_path, "w+") as src_out, open(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking, has this been tested with multiple partitions? I just want to ensure no race condition happens when you have multiple workers writing to the same file.

def __call__(
self,
dataset: ParallelDataset,
metadata_field_name_mapping: Dict[str, str] = {},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Canmetadata_field_name_mapping be moved to the __init__ method? I want to have the __call__ method be exclusively for the dataset so it's super easy to chain the filters together with nemo_curator.Sequential.

return self._score_bitext(**kwargs)

@abstractmethod
def _score_bitext(self, src, tgt, **kwargs):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: To keep with the convention of DocumentFilter, have rename these methods to score_bitext and keep_bitext.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants