-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallel data curation #193
Open
shuoyangd
wants to merge
50
commits into
NVIDIA:main
Choose a base branch
from
shuoyangd:main
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
50 commits
Select commit
Hold shift + click to select a range
c7a6423
add data interface to read simple bitext
shuoyangd 4b3dc97
adding ParallelScoreFilter
nverma1 114716e
add test for ParallelScoreFilter, small style change for ParallelData…
shuoyangd cbab143
allow ParallelScoreFilter to take different filters for source and ta…
shuoyangd 82f5486
add JointScoreFilter and LengthRatioFilter
nverma1 f9a0535
[WIP] add heuristic filter w/o test
shuoyangd 8f25988
merge with main
shuoyangd 612249c
add test for histogram filter, fix a few bugs
shuoyangd 2fe4973
length ratio, joint score filter testing
nverma1 b61d7f1
fix typing in joint test
nverma1 f63a1f9
add a fake comet qe filter as an initial step
shuoyangd 76bced7
[WIP] adding bitext cleaning tutorial
nverma1 1a2bb1e
[WIP] fixing example
nverma1 74698d5
fix slow histogram filter, fix faulty bitext loading
shuoyangd bf2e6ac
tutorial running
nverma1 62d1242
[WIP] documentation of bitext tutorial
nverma1 c413ea2
add tested version of comet-qe filter
shuoyangd 5a90038
fix ParallelDataset bug where single file name is not accepted, and d…
shuoyangd f8046dd
add docstring to explain simple bitext format, fix a bug where file e…
shuoyangd 6c7aea4
remove print line for debug
shuoyangd a457995
add comet filter to tutorial
shuoyangd c5a6f1c
refactor COMET QE filter to decouple model from filter, make sure Joi…
shuoyangd 61713e4
use refactored qe filter
shuoyangd a4d2bb3
wrap_qe_input should be a static method
shuoyangd 0674400
use conditional import for comet, formatting changes
shuoyangd 6936f9a
[WIP] add cometoid
shuoyangd da96d29
[WIP] attempt to resolve device conflict but is failing
shuoyangd 14b7d70
[WIP] playing with cometoid arguments
shuoyangd b02b56d
[WIP] -d 0 doesn't look necessary
shuoyangd 6c1e719
tested arguments for Cometoid
shuoyangd 70a7fe8
use proper safe import, make sure test doesn't crash sans comet/pymarian
shuoyangd c66d7f9
falling back to comet for tutorial since that's easier to set up, upp…
shuoyangd 861bd4d
give credit to original fairseq implementation of histogram filtering…
shuoyangd 52ba08e
fix pre-commit complaint
shuoyangd 62c254b
fix small bug
shuoyangd 91ea9fa
fix another occurrence of the same bug
shuoyangd 12783ec
introduce shard limit to a single PyMarian API call to avoid memory l…
shuoyangd a65588a
repartition after reading simple bitext data
shuoyangd 3f1d09b
-d 0 is actually needed for pymarian
shuoyangd 102429a
remove duplicate LengthRatioFilter definition
shuoyangd 8a367dd
refactor repeated code segment in file writing, change classifier to …
shuoyangd 396d7ba
[WIP] addressed comments in #193 apart from resolving .iloc pattern, …
shuoyangd eb4f4df
refactor to resolve .loc pattern, test passing
shuoyangd 3addf44
add missing file
shuoyangd a14a78a
revert changes in setup.py
shuoyangd 6b8dfa0
fix a small bug in parallel dataset, explain why repartition is disab…
shuoyangd bb4f148
add api guide, small change on bitext/parallel score filter docstring
shuoyangd d309744
fix read_simple_bitext test issues
shuoyangd 21676bd
Merge branch 'main' into main
shuoyangd 7797925
reinstate dependencies lost during merging
shuoyangd File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
import csv | ||
from typing import List, Optional, Tuple, Union | ||
|
||
import dask.dataframe as dd | ||
import pandas as pd | ||
|
||
from nemo_curator.datasets.doc_dataset import DocumentDataset | ||
from nemo_curator.utils.distributed_utils import write_to_disk | ||
from nemo_curator.utils.file_utils import remove_path_extension | ||
from nemo_curator.utils.import_utils import gpu_only_import | ||
|
||
cudf = gpu_only_import("cudf") | ||
dask_cudf = gpu_only_import("dask_cudf") | ||
|
||
|
||
class ParallelDataset(DocumentDataset): | ||
""" | ||
An extension of the standard `DocumentDataset` with a special method that loads simple bitext. | ||
|
||
For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format | ||
and use interfaces defined in `DocumentDataset`. | ||
""" | ||
|
||
def persist(self): | ||
return ParallelDataset(self.df.persist()) | ||
|
||
@classmethod | ||
def read_simple_bitext( | ||
cls, | ||
src_input_files: Union[str, List[str]], | ||
tgt_input_files: Union[str, List[str]], | ||
src_lang: str, | ||
tgt_lang: str, | ||
backend: str = "pandas", | ||
add_filename: bool = False, | ||
partition_size: Optional[Union[int, str]] = "100MB", | ||
): | ||
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters. | ||
|
||
Args: | ||
src_input_files (Union[str, List[str]]): one or several input files, in source language | ||
tgt_input_files (Union[str, List[str]]): one or several input files, in target language | ||
|
||
Raises: | ||
TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree. | ||
|
||
Returns: | ||
ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext. | ||
""" | ||
|
||
if isinstance(src_input_files, str) and isinstance(tgt_input_files, str): | ||
src_input_files = [src_input_files] | ||
tgt_input_files = [tgt_input_files] | ||
elif not isinstance(src_input_files, list) or not isinstance( | ||
tgt_input_files, list | ||
): | ||
raise TypeError("Both file inputs must be strings or lists.") | ||
|
||
# TODO: use default doc id for now | ||
# but it might be useful to allow customizing doc id by passing a prefix | ||
df = dd.from_map( | ||
ParallelDataset.read_single_simple_bitext_file_pair, | ||
list(zip(src_input_files, tgt_input_files)), | ||
src_lang=src_lang, | ||
tgt_lang=tgt_lang, | ||
backend=backend, | ||
add_filename=add_filename, | ||
) | ||
|
||
# TODO: Currently a pair of simple bitext file will be loaded into a single partition, | ||
# which means filtering won't be parallelized. | ||
# Presumably, the solution is to repartition the dataset after loading, | ||
# but this introduces problems when running with slurm, so we table this for now. | ||
# if partition_size: | ||
# df = df.repartition(partition_size=partition_size) | ||
return cls(df) | ||
|
||
def to_bitext( | ||
self, | ||
output_file_dir, | ||
write_to_filename=False, | ||
): | ||
"""See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage.""" | ||
write_to_disk( | ||
df=self.df, | ||
output_file_dir=output_file_dir, | ||
write_to_filename=write_to_filename, | ||
output_type="bitext", | ||
) | ||
|
||
@staticmethod | ||
def read_single_simple_bitext_file_pair( | ||
input_file_pair: Tuple[str], | ||
src_lang: str, | ||
tgt_lang: str, | ||
doc_id: str = None, | ||
backend: str = "cudf", | ||
add_filename: bool = False, | ||
) -> Union[dd.DataFrame, dask_cudf.DataFrame]: | ||
"""This function reads a pair of "simple bitext" files into a pandas DataFrame. | ||
A simple bitext is a commonly data format in machine translation. | ||
It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example: | ||
|
||
data.de: | ||
|
||
``` | ||
Wir besitzen keine Reisetaschen aus Leder. | ||
Die Firma produziert Computer für den deutschen Markt. | ||
... | ||
``` | ||
|
||
data.en: | ||
|
||
``` | ||
We don't own duffel bags made of leather. | ||
The company produces computers for the German market. | ||
... | ||
``` | ||
|
||
For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions. | ||
|
||
Args: | ||
input_file_pair (Tuple[str]): A pair of file paths pointing to the input files | ||
src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en') | ||
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en') | ||
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None. | ||
backend (str, optional): Backend of the data frame. Defaults to "cudf". | ||
add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False. | ||
|
||
Returns: | ||
Union[dd.DataFrame, dask_cudf.DataFrame] | ||
""" | ||
src_input_file, tgt_input_file = input_file_pair | ||
assert remove_path_extension(src_input_file) == remove_path_extension( | ||
tgt_input_file | ||
), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}." | ||
|
||
if not doc_id: | ||
doc_id = "▁".join([src_input_file, tgt_input_file]) | ||
|
||
# TODO: it seems like cudf.read_table can only take one file max | ||
# so maybe we shouldn't pass more than one | ||
if backend == "cudf": | ||
df = cudf | ||
else: | ||
df = pd | ||
|
||
df_src = df.read_csv( | ||
src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE | ||
) | ||
df_tgt = df.read_csv( | ||
tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE | ||
) | ||
assert len(df_src) == len( | ||
df_tgt | ||
), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}." | ||
df_combined = df.concat([df_src, df_tgt], axis=1) | ||
df_combined["doc_id"] = doc_id | ||
df_combined["src_lang"] = src_lang | ||
df_combined["tgt_lang"] = tgt_lang | ||
|
||
if add_filename: | ||
df_combined["filename"] = remove_path_extension(src_input_file) | ||
|
||
return df_combined |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this was modified by mistake, so probably revert it unless there's something I'm missing.