Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only has 0.44 accuracy on GSM8K after running the provided codes #13

Open
TrueNobility303 opened this issue Jun 10, 2024 · 10 comments
Open

Comments

@TrueNobility303
Copy link

TrueNobility303 commented Jun 10, 2024

Dear authors,

I train the CLLM model on GSM8k with Abel-7B-001 as the teacher model, using the dataset
cleaned_gsm8k_jacobi dataset you provided on huggingface, and run the train_cllm.sh, and set "use_gt_labels" in the file train_cllm_global.py to be False according to this previous issue.

The trained model only has an accuracy 0.44 after running bash eval/gsm8k/acc.sh, which is much lower than the result of the checkpoint provided by you.

Could you tell me what is wrong? What is the exact hyperparameter to reproduce the results?

I would greatly appreciate it if you could help me.

Best regards.

@TrueNobility303 TrueNobility303 changed the title Cannot reproduce the result Only has 0.44 accuracy on GSM8K after running the provided codes Jun 10, 2024
@karrykkk
Copy link
Collaborator

Hi~ Thanks for your interest in our work! ☺️
Please set use_gt_labels as True for more strict AR labels(as teacher_output_ids may not be accurate though closer to origin distribution) and train a whole epoch with this dataset (max_new_seq_len=256 in the collected Jacobi trajectory). Model collapse may happen during training but the AR loss of ground-truth labels will pull the model back to the normal distribution after more training.
Feel free to reach out if there are still problems.

@TrueNobility303
Copy link
Author

I use exactly the same dataset and train for a whole epoch, but no matter whether setting use_gt_labels as True/ False can not have the desired result.

@karrykkk
Copy link
Collaborator

Hi~ Sorry for the confusion. This may result from a bug (i.e. forgetting to add detach() to loss which leads to extra backpropagation) in the training script. We have fixed this issue in the latest version of the code:

loss = loss_ar.detach() + loss_global.detach()

You can try running the updated code and the results should be normal now. Let me know if you have any other questions!

@TrueNobility303
Copy link
Author

TrueNobility303 commented Jun 18, 2024

But after this modification, the accuracy becomes 0.0. It seems that this modification is not correct.

@karrykkk
Copy link
Collaborator

karrykkk commented Jun 20, 2024

My bad😥... While we do use .detach() during CLLM training, we found the key bug causing this result is that the labels_ids in the dataset we provided before is defective. As shown in the screenshot below, the labels_ids only include the question, and the ground-truth answer is missing.
image
We have launched a modified version and will update the dataset in the hugging face as soon as the generation ends. You can either wait for the generation to complete or generate your own dataset by running the provided script. Sorry again and thanks for your patience and understanding.

@TrueNobility303
Copy link
Author

But it is strange that setting use_gt_labels=False still does not solve this problem.

@snyhlxde1
Copy link
Collaborator

snyhlxde1 commented Aug 5, 2024

Hi @TrueNobility303. Thanks for your patience! We have identified the problems in the training script:

  1. instead of using complete_teacher_output_ids, teacher_output_ids from the trajectory dataset should be used.
  2. during training, model should be loaded in bf16 (in consistency with the setting when generating the trajectories).

Please pull the code again. After applying the patches, training a model should be able to give you a much better performance as we have reported in the paper.

@Songwxuan
Copy link

Hi~Thank you for your reply. So after the modification, should I set use_gt_labels as True or not?

@karrykkk
Copy link
Collaborator

karrykkk commented Jan 8, 2025

Sorry for the earlier confusion. We have checked that the current version, with use_gt_labels set to False, can reproduce the results reported in the paper.

@Songwxuan
Copy link

Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants