Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix in src/autotrain/trainers/clm/utils.py in function #836

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Pragyan02
Copy link

Refactor process_data_with_chat_template to Improve Validation Data Handling

Description:

This pull request refactors the process_data_with_chat_template function to improve clarity and handling of the optional valid_data argument. Specifically, it ensures that valid_data can be passed as None without requiring manual initialization within the function.

Key Changes:

Optional Argument Handling:

  • Updated the valid_data argument to be optional (valid_data=None) in the function definition.
  • Removed the redundant initialization of valid_data to None within the function body.

Improved Readability:

  • Simplified the function signature by directly reflecting the optional nature of valid_data.
  • This improves the function's usability and makes the intent clearer when valid_data is not provided.

Unchanged Logic:

  • The core logic of the function remains intact. The chat template is applied to both train_data and valid_data if specified in the configuration.

@Pragyan02 Pragyan02 closed this Jan 3, 2025
@Pragyan02 Pragyan02 reopened this Jan 3, 2025
@Pragyan02
Copy link
Author

Hi @abhishekkrthakur,

This is my first-ever open source contribution, and I’m really excited to be part of this project! 🎉

I’d love to hear your feedback or suggestions for improvement on this PR. Please let me know if there’s anything I can refine or do differently. Looking forward to learning from your insights!

Thank you!

Copy link

@Ruhaan838 Ruhaan838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your idea is work.
But it no longer takes the argument of the valid_data Strictly.
Is that the actual need?

@Pragyan02
Copy link
Author

Hi @Ruhaan838 ,

Thank you for your feedback!

The reason for this fix was because:

If config.valid_split is not None, but valid_data is None, the current implementation would raise an error when attempting to call .map on valid_data. Making valid_data optional ensures the function gracefully handles such cases without breaking.

@Pragyan02 Pragyan02 requested a review from Ruhaan838 January 20, 2025 13:14
Copy link

@Ruhaan838 Ruhaan838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#836 Is Ready for merge
I think he is right.
also, it's allows skipping the valid_data to be Optional

@abhishekkrthakur
Copy link
Member

sorry for my late response on this. as mentioned in several past issues, we dont want to have option for validation in llm finetuning. And if we do, this PR doesnt cover it. I suggest you to take a look at the codebase and provide changes for all places where valid data can be used instead in order to proceed with the PR. you should also provide example test runs with different llm tasks :)

@Ruhaan838
Copy link

Oh, I understand you're asking if we need to cover all the existing functions or classes to accomplish this specific task and maintain the codebase effectively. Therefore, please make the valid_data argument optional everywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants