-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallel data curation #193
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]>
…rget Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]>
…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…date README Signed-off-by: Shuoyang Ding <[email protected]>
…, run black formatter Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…eakage Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such good work! I have a few additional requests around documentation. Can you create one or more documentation pages about curating parallel datasets in our docs/user-guide/
? You can see how it currently is rendered here. Also, please add all the classes and functions you expect a user to use in our API reference
) | ||
|
||
for idx, (src_line, tgt_line) in enumerate(zip(open(src_file), open(tgt_file))): | ||
assert ds.df["src"].compute()[idx] == src_line.rstrip("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: instead of calling compute multiple times, call it once and then do all the assert statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, I get errors when I call it only once:
TypeError: Trying to convert dd.Scalar<series-..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh I love this! Thanks for adding a tutorial!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very thorough review! I left a few questions and comments. mainly concerned about the refactoring you proposed. Meanwhile I'll proceed with the lower-level changes.
Signed-off-by: Shuoyang Ding <[email protected]>
…accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]>
…tern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
…led, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
Signed-off-by: Shuoyang Ding <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, just a couple of nits. I'll set up a dedicated call to walk through some of the model stuff though, I think that might need a bit more work.
@@ -47,17 +47,7 @@ | |||
"import_downloader", | |||
"import_extractor", | |||
"import_iterator", | |||
"download_common_crawl", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was modified by mistake, so probably revert it unless there's something I'm missing.
def _single_partition_write_to_simple_bitext(out_df, output_file_path): | ||
src_output_file_path = output_file_path + f".{out_df['src_lang'].iloc[0]}" | ||
tgt_output_file_path = output_file_path + f".{out_df['tgt_lang'].iloc[0]}" | ||
with open(src_output_file_path, "w+") as src_out, open( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, has this been tested with multiple partitions? I just want to ensure no race condition happens when you have multiple workers writing to the same file.
def __call__( | ||
self, | ||
dataset: ParallelDataset, | ||
metadata_field_name_mapping: Dict[str, str] = {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Canmetadata_field_name_mapping
be moved to the __init__
method? I want to have the __call__
method be exclusively for the dataset so it's super easy to chain the filters together with nemo_curator.Sequential
.
return self._score_bitext(**kwargs) | ||
|
||
@abstractmethod | ||
def _score_bitext(self, src, tgt, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: To keep with the convention of DocumentFilter
, have rename these methods to score_bitext
and keep_bitext
.
Description
This PR adds support for parallel data curation. Namely:
ParallelDataset
that supports loading and writing parallel data in simple bitext format.ScoreFilter
subclassParallelScoreFilter
that allows application of existing monolingual filters on parallel data while maintaining the alignment of sentence/document pairs.ScoreFilter
subclassJointScoreFilter
that allows implementation of filters that takes both fields of the parallel sentence/document pairs.HistogramFilter
andLengthRatioFilter
.QualityEstimationFilter
.comet
andcometoid
.Joint work at MTMA 2024 with @nverma1.
Usage
See
tutorials/bitext_cleaning/main.py
.Checklist