Skip reading files with incorrect extension #318

sarahyurick · 2024-10-22T19:54:26Z

Closes #214.

Signed-off-by: Sarah Yurick <[email protected]>

ayushdg · 2024-10-22T20:34:24Z

We might need to expand the list of extensions since some files are format like .json.gz.
I wonder if an alternative could be to expand get_all_files_paths_under to also filter on an extension. That way users can specify what extension they want to filter on.
I'm hoping that #50 will make things easier in this regard.

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2024-10-23T22:24:28Z

Thanks @ayushdg ! I like your idea of having it in get_all_files_paths_under so I changed it to use that instead.

Also, I agree with you about .json.gz. I think it is outside the scope of this PR, but I have added it to #50 for now.

sarahyurick · 2024-10-23T22:27:26Z

nemo_curator/utils/distributed_utils.py

+
+        input_extensions = {os.path.splitext(f)[-1] for f in input_files}
+        if len(input_extensions) != 1:
+            raise RuntimeError(


An example of when we would expect this RuntimeError is for:

doc = DocumentDataset.read_json(in_files)

Where in_files is a string path to a directory with multiple JSONL files and a CRC file. Since the CRC file is not explicitly being filtered out, we raise the error.

We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

sarahyurick · 2024-10-23T22:29:58Z

nemo_curator/utils/file_utils.py

+    root: str,
+    recurse_subdirectories: bool = True,
+    followlinks: bool = False,
+    filter_by: Optional[Union[str, List[str]]] = None,


All of these examples work:
(1)

input_files = get_all_files_paths_under(in_files, filter_by="jsonl") input_dataset = DocumentDataset.read_json(input_files)

(2)

input_files = get_all_files_paths_under(in_files, filter_by=["jsonl"]) input_dataset = DocumentDataset.read_json(input_files)

(3)

# Returns a list containing only .jsonl, .parquet, and .csv files input_files = get_all_files_paths_under(in_files, filter_by=["jsonl", "parquet", "csv"])

ayushdg

Thanks for the changes. Overall changes lgtm! Minor nits/comments.

As a followup it might make sense to track updating tutorials/notebooks to use this newer filter arg in the api but not required for this pr.

nemo_curator/datasets/doc_dataset.py

nemo_curator/utils/file_utils.py

ayushdg · 2024-11-05T18:54:01Z

nemo_curator/utils/file_utils.py

+        if file.endswith(tuple(file_extensions)):
+            filtered_files.append(file)
+        else:
+            warnings.warn(f"Skipping read for file: {file}")


I wonder if this might get too noisy in some cases. I'm leaning towards warning once if we have to skip, but not for every file we skip.

ayushdg · 2024-11-05T18:57:13Z

nemo_curator/utils/distributed_utils.py

+
+        input_extensions = {os.path.splitext(f)[-1] for f in input_files}
+        if len(input_extensions) != 1:
+            raise RuntimeError(


We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2024-11-06T00:16:14Z

Thanks @ayushdg ! Updated.

Signed-off-by: Sarah Yurick <[email protected]>

nemo_curator/datasets/doc_dataset.py

nemo_curator/utils/file_utils.py

nemo_curator/datasets/doc_dataset.py

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick · 2024-11-12T19:49:27Z

Thank you @praateekmahajan ! I have addressed all your comments.

nemo_curator/utils/distributed_utils.py

praateekmahajan

LGTM! Thanks for adding type hint as well. (left two small nits on comment / typehint)

Signed-off-by: Sarah Yurick <[email protected]>

filter_files_by_extension function

64788e5

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick and others added 2 commits October 23, 2024 11:19

Merge branch 'NVIDIA:main' into filter_files_by_extension

9c391ca

add type checking

4d8f4b7

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick mentioned this pull request Oct 23, 2024

Write to file without including "filename" column #317

Merged

sarahyurick added 2 commits October 23, 2024 15:22

add filter_by param to get_all_files_paths_under

f447f01

Signed-off-by: Sarah Yurick <[email protected]>

isort

645655d

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick commented Oct 23, 2024

View reviewed changes

sarahyurick requested a review from ayushdg October 28, 2024 18:51

ayushdg requested changes Nov 5, 2024

View reviewed changes

ayushdg requested a review from praateekmahajan November 5, 2024 18:59

sarahyurick added 3 commits November 5, 2024 16:00

address ayush's comments

e31a3f1

Signed-off-by: Sarah Yurick <[email protected]>

run black

79180f9

Signed-off-by: Sarah Yurick <[email protected]>

trailing whitespace

ea0be00

Signed-off-by: Sarah Yurick <[email protected]>

sarahyurick requested a review from ayushdg November 6, 2024 00:07

more whitespace

9e5a7a9

Signed-off-by: Sarah Yurick <[email protected]>

Merge branch 'main' into filter_files_by_extension

9ef670e

Signed-off-by: Sarah Yurick <[email protected]>