-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finetune problem #22
Comments
While I haven't encountered this exact issue before, I have noted similar problems in other projects: issue#1, issue#2, issue#3. It appears that this could be due to either the settings of the deepspeed optimizer or the deepspeed config causing the bf16 to overflow, or it might be related to the transformer and deepspeed version you are using. Can you provide your specific deepspeed config settings, as well as inform me of the transformer and deepspeed version you're currently using? Taking into consideration issue#1, setting the transformers to 4.28.0, and deepspeed to 0.8.3, might rectify the issue. Also, it might be beneficial to remove the fp16 and lr scheduler configurations from the deepspeed config, and switch the optimizer to adamw. |
deepspeed strips:
} |
Perhaps consider configuring the parameters in the optimizer and scheduler to "auto". As far as I know, the HuggingFace trainer sets those parameters based on your training_args, which might otherwise cause some complications. It that do not work, maybe try to change the version of deepspeed. I am also try to reporduce this promble in my own environment. |
Thank you for your response. I'm not sure if it's because my task is too challenging. If it's convenient, could I share my data with you? |
Absolutely, feel free to share your data with me. You can get in touch with me via email. I will do my best to respond promptly. P.S. Regarding your problem, I'm curious to know the size of your dataset. Is it possible that the learning rate is declining, but due to data precision, it's not discernible? For example, the learning rate could be 0.000095. |
| |
涂曼姝
|
|
***@***.***
|
Here is my training dataset.
This task involves inputting pictures of clothing and descriptions. Generating suitable descriptions of backgrounds\models\shoes for displaying the clothing.
…---- Replied Message ----
| From | Haozhe ***@***.***> |
| Date | 11/7/2023 16:50 |
| To | ***@***.***> |
| Cc | ***@***.***> ,
***@***.***> |
| Subject | Re: [HaozheZhao/MIC] finetune problem (Issue #22) |
Absolutely, feel free to share your data with me. You can get in touch with me via email. I will do my best to respond promptly.
P.S. Regarding your problem, I'm curious to know the size of your dataset. Is it possible that the learning rate is declining, but due to data precision, it's not discernible? For example, the learning rate could be 0.000095.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
从 网易邮箱大师 发来的云附件
|
| |
| |
|
| |
| train.zip |
| |
| (189MB,2023年11月22日 17:18 到期) |
| |
| 下载 |
| |
| |
|
It appears that I am not receiving the email. Perhaps you could call me directly using the email in my profile. |
OK, I sent an email to [email protected]. |
Questions:
I used my own data to train, I found the lr always be 0:
train script:
export EXPERIMENT_NAME=instruct_BLIP_deepSpeed_t5xxl_unfreeze_Projection_LLM_QV_weight_without_instruct_qformer_fshi
export DATASET_NAME=flickr
export CUDA_VISIBLE_DEVICES=0,1,2,3,4
export MODEL_DIR=models/
model_name_or_path=/data2/tutu/model/MMICL-Instructblip-T5-xxl
processor_path=/data2/tutu/model/instructblip-flan-t5-xxl
bs=3 #3
eval_bs=4
lr=1e-4
dropout=0.1
epoch=10
seed=1111
do_train=True
do_test=False
do_valid=False
master_port=29504
model_type=instructblip
deepspeed --master_port $master_port run.py
--experiment_name ${EXPERIMENT_NAME}
--dataset_name ${DATASET_NAME}
--dataset_config_name None
--load_datatype json
--max_seq_length 512
--overwrite_cache True
--pad_to_max_length True
--train_file /data2/tutu/MIC/Data/fushi_data/train
--validation_file /data2/tutu/MIC/Data/fushi_data/test
--test_file /data2/tutu/MIC/Data/fushi_data/test
--do_train $do_train
--do_eval $do_valid
--do_predict $do_test
--per_device_train_batch_size ${bs}
--bf16
--model_type $model_type
--save_total_limit 3
--per_device_eval_batch_size ${eval_bs}
--gradient_accumulation_steps 6
--num_train_epochs ${epoch}
--output_dir checkpoints/${EXPERIMENT_NAME}
--overwrite_output_dir
--learning_rate ${lr}
--weight_decay 0.0005
--seed ${seed}
--warmup_ratio 0
--evaluation_strategy steps
--eval_steps 50
--remove_unused_columns False
--model_name_or_path $model_name_or_path
--use_fast_tokenizer True
--processor_path $processor_path
--model_type 'instructblip'
--model_revision main
--eval_type val
--generation_max_length 64
--done_preprocess True
--max_eval_samples 200
--max_predict_samples 200
--run_name ${EXPERIMENT_NAME}
--using_instruct_qformer False
--deepspeed config/deepspeed_config.json
--save_steps 50
--load_best_model_at_end False
--logging_steps 10
--plot_loss True
--lr_scheduler_type cosine \
--multiple_choice True
Best wishes.
The text was updated successfully, but these errors were encountered: