Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parallel data curation #193

Open
wants to merge 50 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
c7a6423
add data interface to read simple bitext
shuoyangd Jul 30, 2024
4b3dc97
adding ParallelScoreFilter
nverma1 Jul 30, 2024
114716e
add test for ParallelScoreFilter, small style change for ParallelData…
shuoyangd Jul 31, 2024
cbab143
allow ParallelScoreFilter to take different filters for source and ta…
shuoyangd Jul 31, 2024
82f5486
add JointScoreFilter and LengthRatioFilter
nverma1 Jul 31, 2024
f9a0535
[WIP] add heuristic filter w/o test
shuoyangd Jul 31, 2024
8f25988
merge with main
shuoyangd Jul 31, 2024
612249c
add test for histogram filter, fix a few bugs
shuoyangd Jul 31, 2024
2fe4973
length ratio, joint score filter testing
nverma1 Jul 31, 2024
b61d7f1
fix typing in joint test
nverma1 Jul 31, 2024
f63a1f9
add a fake comet qe filter as an initial step
shuoyangd Aug 1, 2024
76bced7
[WIP] adding bitext cleaning tutorial
nverma1 Aug 1, 2024
1a2bb1e
[WIP] fixing example
nverma1 Aug 2, 2024
74698d5
fix slow histogram filter, fix faulty bitext loading
shuoyangd Aug 2, 2024
bf2e6ac
tutorial running
nverma1 Aug 2, 2024
62d1242
[WIP] documentation of bitext tutorial
nverma1 Aug 2, 2024
c413ea2
add tested version of comet-qe filter
shuoyangd Aug 2, 2024
5a90038
fix ParallelDataset bug where single file name is not accepted, and d…
shuoyangd Aug 5, 2024
f8046dd
add docstring to explain simple bitext format, fix a bug where file e…
shuoyangd Aug 5, 2024
6c7aea4
remove print line for debug
shuoyangd Aug 5, 2024
a457995
add comet filter to tutorial
shuoyangd Aug 5, 2024
c5a6f1c
refactor COMET QE filter to decouple model from filter, make sure Joi…
shuoyangd Aug 5, 2024
61713e4
use refactored qe filter
shuoyangd Aug 5, 2024
a4d2bb3
wrap_qe_input should be a static method
shuoyangd Aug 5, 2024
0674400
use conditional import for comet, formatting changes
shuoyangd Aug 6, 2024
6936f9a
[WIP] add cometoid
shuoyangd Aug 6, 2024
da96d29
[WIP] attempt to resolve device conflict but is failing
shuoyangd Aug 7, 2024
14b7d70
[WIP] playing with cometoid arguments
shuoyangd Aug 7, 2024
b02b56d
[WIP] -d 0 doesn't look necessary
shuoyangd Aug 7, 2024
6c1e719
tested arguments for Cometoid
shuoyangd Aug 8, 2024
70a7fe8
use proper safe import, make sure test doesn't crash sans comet/pymarian
shuoyangd Aug 8, 2024
c66d7f9
falling back to comet for tutorial since that's easier to set up, upp…
shuoyangd Aug 8, 2024
861bd4d
give credit to original fairseq implementation of histogram filtering…
shuoyangd Aug 8, 2024
52ba08e
fix pre-commit complaint
shuoyangd Aug 8, 2024
62c254b
fix small bug
shuoyangd Aug 11, 2024
91ea9fa
fix another occurrence of the same bug
shuoyangd Aug 13, 2024
12783ec
introduce shard limit to a single PyMarian API call to avoid memory l…
shuoyangd Aug 13, 2024
a65588a
repartition after reading simple bitext data
shuoyangd Aug 16, 2024
3f1d09b
-d 0 is actually needed for pymarian
shuoyangd Aug 16, 2024
102429a
remove duplicate LengthRatioFilter definition
shuoyangd Sep 5, 2024
8a367dd
refactor repeated code segment in file writing, change classifier to …
shuoyangd Sep 20, 2024
396d7ba
[WIP] addressed comments in #193 apart from resolving .iloc pattern, …
shuoyangd Sep 20, 2024
eb4f4df
refactor to resolve .loc pattern, test passing
shuoyangd Oct 1, 2024
3addf44
add missing file
shuoyangd Oct 1, 2024
a14a78a
revert changes in setup.py
shuoyangd Oct 1, 2024
6b8dfa0
fix a small bug in parallel dataset, explain why repartition is disab…
shuoyangd Oct 1, 2024
bb4f148
add api guide, small change on bitext/parallel score filter docstring
shuoyangd Oct 1, 2024
d309744
fix read_simple_bitext test issues
shuoyangd Oct 1, 2024
21676bd
Merge branch 'main' into main
shuoyangd Oct 1, 2024
7797925
reinstate dependencies lost during merging
shuoyangd Oct 2, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion docs/user-guide/api/datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,7 @@ DocumentDataset
-------------------

.. autoclass:: nemo_curator.datasets.DocumentDataset
:members:
:members:

.. autoclass:: nemo_curator.datasets.ParallelDataset
:members:
20 changes: 20 additions & 0 deletions docs/user-guide/api/filters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ Base Class
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.BitextFilter
:members:
:member-order: bysource

.. autofunction:: nemo_curator.filters.import_filter

------------------------------
Expand Down Expand Up @@ -40,6 +44,14 @@ FastText Filters
:members:
:member-order: bysource

------------------------------
Quality Estimation Filters
------------------------------

.. autoclass:: nemo_curator.filters.QualityEstimationFilter
:members:
:member-order: bysource

------------------------------
Heuristic Filters
------------------------------
Expand Down Expand Up @@ -132,6 +144,14 @@ Heuristic Filters
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.HistogramFilter
:members:
:member-order: bysource

.. autoclass:: nemo_curator.filters.LengthRatioFilter
:members:
:member-order: bysource

------------------------------
Code Filters
------------------------------
Expand Down
3 changes: 2 additions & 1 deletion nemo_curator/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
from nemo_curator.utils.import_utils import image_only_import_from

from .doc_dataset import DocumentDataset
from .parallel_dataset import ParallelDataset

ImageTextPairDataset = image_only_import_from(
"nemo_curator.datasets.image_text_pair_dataset", "ImageTextPairDataset"
)

__all__ = ["DocumentDataset", "ImageTextPairDataset"]
__all__ = ["DocumentDataset", "ImageTextPairDataset", "ParallelDataset"]
165 changes: 165 additions & 0 deletions nemo_curator/datasets/parallel_dataset.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
import csv
from typing import List, Optional, Tuple, Union

import dask.dataframe as dd
import pandas as pd

from nemo_curator.datasets.doc_dataset import DocumentDataset
from nemo_curator.utils.distributed_utils import write_to_disk
from nemo_curator.utils.file_utils import remove_path_extension
from nemo_curator.utils.import_utils import gpu_only_import

cudf = gpu_only_import("cudf")
dask_cudf = gpu_only_import("dask_cudf")


class ParallelDataset(DocumentDataset):
"""
An extension of the standard `DocumentDataset` with a special method that loads simple bitext.

For data with more complicated metadata, please convert your data into jsonl/parquet/pickle format
and use interfaces defined in `DocumentDataset`.
"""

def persist(self):
return ParallelDataset(self.df.persist())

@classmethod
def read_simple_bitext(
cls,
src_input_files: Union[str, List[str]],
tgt_input_files: Union[str, List[str]],
src_lang: str,
tgt_lang: str,
backend: str = "pandas",
add_filename: bool = False,
partition_size: Optional[Union[int, str]] = "100MB",
):
"""See `read_single_simple_bitext_file_pair` docstring for what "simple_bitext" means and usage of other parameters.

Args:
src_input_files (Union[str, List[str]]): one or several input files, in source language
tgt_input_files (Union[str, List[str]]): one or several input files, in target language

Raises:
TypeError: If types of `src_input_files` and `tgt_input_files` doesn't agree.

Returns:
ParallelDataset: A `ParallelDataset` object with `self.df` holding the ingested simple bitext.
"""

if isinstance(src_input_files, str) and isinstance(tgt_input_files, str):
src_input_files = [src_input_files]
tgt_input_files = [tgt_input_files]
elif not isinstance(src_input_files, list) or not isinstance(
tgt_input_files, list
):
raise TypeError("Both file inputs must be strings or lists.")

# TODO: use default doc id for now
# but it might be useful to allow customizing doc id by passing a prefix
df = dd.from_map(
ParallelDataset.read_single_simple_bitext_file_pair,
list(zip(src_input_files, tgt_input_files)),
src_lang=src_lang,
tgt_lang=tgt_lang,
backend=backend,
add_filename=add_filename,
)

# TODO: Currently a pair of simple bitext file will be loaded into a single partition,
# which means filtering won't be parallelized.
# Presumably, the solution is to repartition the dataset after loading,
# but this introduces problems when running with slurm, so we table this for now.
# if partition_size:
# df = df.repartition(partition_size=partition_size)
return cls(df)

def to_bitext(
self,
output_file_dir,
write_to_filename=False,
):
"""See `nemo_curator.utils.distributed_utils.write_to_disk` docstring for parameter usage."""
write_to_disk(
df=self.df,
output_file_dir=output_file_dir,
write_to_filename=write_to_filename,
output_type="bitext",
)

@staticmethod
def read_single_simple_bitext_file_pair(
input_file_pair: Tuple[str],
src_lang: str,
tgt_lang: str,
doc_id: str = None,
backend: str = "cudf",
add_filename: bool = False,
) -> Union[dd.DataFrame, dask_cudf.DataFrame]:
"""This function reads a pair of "simple bitext" files into a pandas DataFrame.
A simple bitext is a commonly data format in machine translation.
It consists of two plain text files with the same number of lines, each line pair being translations of each other. For example:

data.de:

```
Wir besitzen keine Reisetaschen aus Leder.
Die Firma produziert Computer für den deutschen Markt.
...
```

data.en:

```
We don't own duffel bags made of leather.
The company produces computers for the German market.
...
```

For simplicity, we also assume that the names of the two text files have the same prefix, except for different language code at the end as file extensions.

Args:
input_file_pair (Tuple[str]): A pair of file paths pointing to the input files
src_lang (str): Source language, in ISO-639-1 (two character) format (e.g. 'en')
tgt_lang (str): Target language, in ISO-639-1 (two character) format (e.g. 'en')
doc_id (str, optional): A string document id to assign to every segment in the file. Defaults to None.
backend (str, optional): Backend of the data frame. Defaults to "cudf".
add_filename (bool, optional): Add filename as an extra field to every segment in the file. Defaults to False.

Returns:
Union[dd.DataFrame, dask_cudf.DataFrame]
"""
src_input_file, tgt_input_file = input_file_pair
assert remove_path_extension(src_input_file) == remove_path_extension(
tgt_input_file
), f"Assuming source and target filenames would have common prefix before language code, but got {src_input_file} and {tgt_input_file}."

if not doc_id:
doc_id = "▁".join([src_input_file, tgt_input_file])

# TODO: it seems like cudf.read_table can only take one file max
# so maybe we shouldn't pass more than one
if backend == "cudf":
df = cudf
else:
df = pd

df_src = df.read_csv(
src_input_file, names=["src"], sep="\t", quoting=csv.QUOTE_NONE
)
df_tgt = df.read_csv(
tgt_input_file, names=["tgt"], sep="\t", quoting=csv.QUOTE_NONE
)
assert len(df_src) == len(
df_tgt
), f"We assume the source and target file would have the same number of lines, but got {len(df_src)} and {len(df_tgt)}."
df_combined = df.concat([df_src, df_tgt], axis=1)
df_combined["doc_id"] = doc_id
df_combined["src_lang"] = src_lang
df_combined["tgt_lang"] = tgt_lang

if add_filename:
df_combined["filename"] = remove_path_extension(src_input_file)

return df_combined
10 changes: 0 additions & 10 deletions nemo_curator/download/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,7 @@
"import_downloader",
"import_extractor",
"import_iterator",
"download_common_crawl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was modified by mistake, so probably revert it unless there's something I'm missing.

"CommonCrawlWARCDownloader",
"CommonCrawlWARCExtractor",
"CommonCrawlWARCIterator",
"CommonCrawlWARCDownloaderExtractOnly",
"JusTextExtractor",
"ResiliparseExtractor",
"download_wikipedia",
"WikipediaDownloader",
"WikipediaIterator",
"WikipediaExtractor",
"batch_download",
"download_arxiv",
"ArxivDownloader",
Expand Down
13 changes: 12 additions & 1 deletion nemo_curator/filters/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.

from .classifier_filter import FastTextLangId, FastTextQualityFilter
from .bitext_filter import BitextFilter
from .classifier_filter import (
FastTextLangId,
FastTextQualityFilter,
QualityEstimationFilter,
)
from .code import (
AlphaFilter,
GeneralCommentToCodeFilter,
Expand All @@ -29,6 +34,8 @@
BulletsFilter,
CommonEnglishWordsFilter,
EllipsisFilter,
HistogramFilter,
LengthRatioFilter,
LongWordFilter,
MeanWordLengthFilter,
NonAlphaNumericFilter,
Expand All @@ -50,6 +57,7 @@
)

__all__ = [
"BitextFilter",
"DocumentFilter",
"import_filter",
"FastTextLangId",
Expand Down Expand Up @@ -84,4 +92,7 @@
"AlphaFilter",
"HTMLBoilerplateFilter",
"PerExtensionFilter",
"LengthRatioFilter",
shuoyangd marked this conversation as resolved.
Show resolved Hide resolved
"HistogramFilter",
"QualityEstimationFilter",
]
Loading
Loading