Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip reading files with incorrect extension #318

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

sarahyurick
Copy link
Collaborator

Closes #214.

@ayushdg
Copy link
Collaborator

ayushdg commented Oct 22, 2024

We might need to expand the list of extensions since some files are format like .json.gz.
I wonder if an alternative could be to expand get_all_files_paths_under to also filter on an extension. That way users can specify what extension they want to filter on.
I'm hoping that #50 will make things easier in this regard.

@sarahyurick
Copy link
Collaborator Author

sarahyurick commented Oct 23, 2024

Thanks @ayushdg ! I like your idea of having it in get_all_files_paths_under so I changed it to use that instead.

Also, I agree with you about .json.gz. I think it is outside the scope of this PR, but I have added it to #50 for now.


input_extensions = {os.path.splitext(f)[-1] for f in input_files}
if len(input_extensions) != 1:
raise RuntimeError(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An example of when we would expect this RuntimeError is for:

doc = DocumentDataset.read_json(in_files)

Where in_files is a string path to a directory with multiple JSONL files and a CRC file. Since the CRC file is not explicitly being filtered out, we raise the error.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

root: str,
recurse_subdirectories: bool = True,
followlinks: bool = False,
filter_by: Optional[Union[str, List[str]]] = None,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these examples work:
(1)

input_files = get_all_files_paths_under(in_files, filter_by="jsonl")
input_dataset = DocumentDataset.read_json(input_files)

(2)

input_files = get_all_files_paths_under(in_files, filter_by=["jsonl"])
input_dataset = DocumentDataset.read_json(input_files)

(3)

# Returns a list containing only .jsonl, .parquet, and .csv files
input_files = get_all_files_paths_under(in_files, filter_by=["jsonl", "parquet", "csv"])

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Overall changes lgtm! Minor nits/comments.

As a followup it might make sense to track updating tutorials/notebooks to use this newer filter arg in the api but not required for this pr.

nemo_curator/datasets/doc_dataset.py Outdated Show resolved Hide resolved
nemo_curator/utils/file_utils.py Outdated Show resolved Hide resolved
if file.endswith(tuple(file_extensions)):
filtered_files.append(file)
else:
warnings.warn(f"Skipping read for file: {file}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this might get too noisy in some cases. I'm leaning towards warning once if we have to skip, but not for every file we skip.


input_extensions = {os.path.splitext(f)[-1] for f in input_files}
if len(input_extensions) != 1:
raise RuntimeError(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl] using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.

Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick
Copy link
Collaborator Author

Thanks @ayushdg ! Updated.

Signed-off-by: Sarah Yurick <[email protected]>
@sarahyurick
Copy link
Collaborator Author

Thank you @praateekmahajan ! I have addressed all your comments.

Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for adding type hint as well. (left two small nits on comment / typehint)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentDataset read errors when other files are present in directory
3 participants