-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip reading files with incorrect extension #318
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Sarah Yurick <[email protected]>
We might need to expand the list of extensions since some files are format like |
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
|
||
input_extensions = {os.path.splitext(f)[-1] for f in input_files} | ||
if len(input_extensions) != 1: | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of when we would expect this RuntimeError
is for:
doc = DocumentDataset.read_json(in_files)
Where in_files
is a string path to a directory with multiple JSONL files and a CRC file. Since the CRC file is not explicitly being filtered out, we raise the error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl]
using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.
nemo_curator/utils/file_utils.py
Outdated
root: str, | ||
recurse_subdirectories: bool = True, | ||
followlinks: bool = False, | ||
filter_by: Optional[Union[str, List[str]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these examples work:
(1)
input_files = get_all_files_paths_under(in_files, filter_by="jsonl")
input_dataset = DocumentDataset.read_json(input_files)
(2)
input_files = get_all_files_paths_under(in_files, filter_by=["jsonl"])
input_dataset = DocumentDataset.read_json(input_files)
(3)
# Returns a list containing only .jsonl, .parquet, and .csv files
input_files = get_all_files_paths_under(in_files, filter_by=["jsonl", "parquet", "csv"])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Overall changes lgtm! Minor nits/comments.
As a followup it might make sense to track updating tutorials/notebooks to use this newer filter arg in the api but not required for this pr.
nemo_curator/utils/file_utils.py
Outdated
if file.endswith(tuple(file_extensions)): | ||
filtered_files.append(file) | ||
else: | ||
warnings.warn(f"Skipping read for file: {file}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this might get too noisy in some cases. I'm leaning towards warning once if we have to skip, but not for every file we skip.
|
||
input_extensions = {os.path.splitext(f)[-1] for f in input_files} | ||
if len(input_extensions) != 1: | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can leave this as is for now.
In theory there might be cases where a user filters by [.json, .jsonl]
using the file filter, but will raise errors here. In practice I expect it to be unlikely so we can wait an see if there is any user feedback around this.
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Thanks @ayushdg ! Updated. |
Signed-off-by: Sarah Yurick <[email protected]>
Signed-off-by: Sarah Yurick <[email protected]>
Thank you @praateekmahajan ! I have addressed all your comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for adding type hint as well. (left two small nits on comment / typehint)
Signed-off-by: Sarah Yurick <[email protected]>
Closes #214.