finetune problem #22

tumanshu · 2023-11-06T09:34:45Z

Questions:
I used my own data to train, I found the lr always be 0:

train script:

export EXPERIMENT_NAME=instruct_BLIP_deepSpeed_t5xxl_unfreeze_Projection_LLM_QV_weight_without_instruct_qformer_fshi
export DATASET_NAME=flickr
export CUDA_VISIBLE_DEVICES=0,1,2,3,4
export MODEL_DIR=models/
model_name_or_path=/data2/tutu/model/MMICL-Instructblip-T5-xxl
processor_path=/data2/tutu/model/instructblip-flan-t5-xxl

bs=3 #3
eval_bs=4
lr=1e-4
dropout=0.1
epoch=10
seed=1111
do_train=True
do_test=False
do_valid=False
master_port=29504
model_type=instructblip
deepspeed --master_port $master_port run.py
--experiment_name ${EXPERIMENT_NAME}
--dataset_name ${DATASET_NAME}
--dataset_config_name None
--load_datatype json
--max_seq_length 512
--overwrite_cache True
--pad_to_max_length True
--train_file /data2/tutu/MIC/Data/fushi_data/train
--validation_file /data2/tutu/MIC/Data/fushi_data/test
--test_file /data2/tutu/MIC/Data/fushi_data/test
--do_train $do_train
--do_eval $do_valid
--do_predict $do_test
--per_device_train_batch_size ${bs}
--bf16
--model_type $model_type
--save_total_limit 3
--per_device_eval_batch_size ${eval_bs}
--gradient_accumulation_steps 6
--num_train_epochs ${epoch}
--output_dir checkpoints/${EXPERIMENT_NAME}
--overwrite_output_dir
--learning_rate ${lr}
--weight_decay 0.0005
--seed ${seed}
--warmup_ratio 0
--evaluation_strategy steps
--eval_steps 50
--remove_unused_columns False
--model_name_or_path $model_name_or_path
--use_fast_tokenizer True
--processor_path $processor_path
--model_type 'instructblip'
--model_revision main
--eval_type val
--generation_max_length 64
--done_preprocess True
--max_eval_samples 200
--max_predict_samples 200
--run_name ${EXPERIMENT_NAME}
--using_instruct_qformer False
--deepspeed config/deepspeed_config.json
--save_steps 50
--load_best_model_at_end False
--logging_steps 10
--plot_loss True
--lr_scheduler_type cosine \

--multiple_choice True

Best wishes.

HaozheZhao · 2023-11-07T06:35:27Z

While I haven't encountered this exact issue before, I have noted similar problems in other projects: issue#1, issue#2, issue#3.

It appears that this could be due to either the settings of the deepspeed optimizer or the deepspeed config causing the bf16 to overflow, or it might be related to the transformer and deepspeed version you are using.

Can you provide your specific deepspeed config settings, as well as inform me of the transformer and deepspeed version you're currently using?

Taking into consideration issue#1, setting the transformers to 4.28.0, and deepspeed to 0.8.3, might rectify the issue. Also, it might be beneficial to remove the fp16 and lr scheduler configurations from the deepspeed config, and switch the optimizer to adamw.

tumanshu · 2023-11-07T07:37:14Z

While I haven't encountered this exact issue before, I have noted similar problems in other projects: issue#1, issue#2, issue#3.

It appears that this could be due to either the settings of the deepspeed optimizer or the deepspeed config causing the bf16 to overflow, or it might be related to the transformer and deepspeed version you are using.

Can you provide your specific deepspeed config settings, as well as inform me of the transformer and deepspeed version you're currently using?

Taking into consideration issue#1, setting the transformers to 4.28.0, and deepspeed to 0.8.3, might rectify the issue. Also, it might be beneficial to remove the fp16 and lr scheduler configurations from the deepspeed config, and switch the optimizer to adamw.

deepspeed strips:
{
"bf16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},

"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 1e-4,
        "betas": "auto",
        "eps": "auto",
        "weight_decay": 0.0005
    }
},

"scheduler": {
    "type": "WarmupLR", 
    "params": {
        "warmup_min_lr": 0, 
        "warmup_max_lr": 0.0001, 
        "warmup_num_steps": 0
    }
}, 

"zero_optimization": {
    "stage": 2,
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
      },
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 6e7,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 6e7,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"

}
deepspeed version is 0.9.3

HaozheZhao · 2023-11-07T07:56:08Z

While I haven't encountered this exact issue before, I have noted similar problems in other projects: issue#1, issue#2, issue#3.
It appears that this could be due to either the settings of the deepspeed optimizer or the deepspeed config causing the bf16 to overflow, or it might be related to the transformer and deepspeed version you are using.
Can you provide your specific deepspeed config settings, as well as inform me of the transformer and deepspeed version you're currently using?
Taking into consideration issue#1, setting the transformers to 4.28.0, and deepspeed to 0.8.3, might rectify the issue. Also, it might be beneficial to remove the fp16 and lr scheduler configurations from the deepspeed config, and switch the optimizer to adamw.

deepspeed strips: { "bf16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 1e-4,
        "betas": "auto",
        "eps": "auto",
        "weight_decay": 0.0005
    }
},

"scheduler": {
    "type": "WarmupLR", 
    "params": {
        "warmup_min_lr": 0, 
        "warmup_max_lr": 0.0001, 
        "warmup_num_steps": 0
    }
}, 

"zero_optimization": {
    "stage": 2,
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
      },
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 6e7,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 6e7,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
} deepspeed version is 0.9.3

Perhaps consider configuring the parameters in the optimizer and scheduler to "auto". As far as I know, the HuggingFace trainer sets those parameters based on your training_args, which might otherwise cause some complications. It that do not work, maybe try to change the version of deepspeed. I am also try to reporduce this promble in my own environment.

tumanshu · 2023-11-07T08:11:39Z

While I haven't encountered this exact issue before, I have noted similar problems in other projects: issue#1, issue#2, issue#3.
It appears that this could be due to either the settings of the deepspeed optimizer or the deepspeed config causing the bf16 to overflow, or it might be related to the transformer and deepspeed version you are using.
Can you provide your specific deepspeed config settings, as well as inform me of the transformer and deepspeed version you're currently using?
Taking into consideration issue#1, setting the transformers to 4.28.0, and deepspeed to 0.8.3, might rectify the issue. Also, it might be beneficial to remove the fp16 and lr scheduler configurations from the deepspeed config, and switch the optimizer to adamw.

deepspeed strips: { "bf16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 },
"optimizer": {
    "type": "AdamW",
    "params": {
        "lr": 1e-4,
        "betas": "auto",
        "eps": "auto",
        "weight_decay": 0.0005
    }
},

"scheduler": {
    "type": "WarmupLR", 
    "params": {
        "warmup_min_lr": 0, 
        "warmup_max_lr": 0.0001, 
        "warmup_num_steps": 0
    }
}, 

"zero_optimization": {
    "stage": 2,
    "offload_param": {
        "device": "cpu",
        "pin_memory": true
      },
    "offload_optimizer": {
        "device": "cpu",
        "pin_memory": true
    },
    "allgather_partitions": true,
    "allgather_bucket_size": 6e7,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 6e7,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto"
} deepspeed version is 0.9.3
Perhaps consider configuring the parameters in the optimizer and scheduler to "auto". As far as I know, the HuggingFace trainer sets those parameters based on your training_args, which might otherwise cause some complications. It that do not work, maybe try to change the version of deepspeed. I am also try to reporduce this promble in my own environment.

Thank you for your response.
In fact, when I removed the optimizer and scheduler from the deepspeed configuration, the LR it's not zero.
However, I noticed that the loss isn't decreasing on my data.

I'm not sure if it's because my task is too challenging. If it's convenient, could I share my data with you?

HaozheZhao · 2023-11-07T08:49:59Z

Absolutely, feel free to share your data with me. You can get in touch with me via email. I will do my best to respond promptly.

P.S. Regarding your problem, I'm curious to know the size of your dataset. Is it possible that the learning rate is declining, but due to data precision, it's not discernible? For example, the learning rate could be 0.000095.

tumanshu · 2023-11-07T09:21:14Z

| | 涂曼姝 | | ***@***.*** | Here is my training dataset. This task involves inputting pictures of clothing and descriptions. Generating suitable descriptions of backgrounds\models\shoes for displaying the clothing.

…

---- Replied Message ---- | From | Haozhe ***@***.***> | | Date | 11/7/2023 16:50 | | To | ***@***.***> | | Cc | ***@***.***> , ***@***.***> | | Subject | Re: [HaozheZhao/MIC] finetune problem (Issue #22) | Absolutely, feel free to share your data with me. You can get in touch with me via email. I will do my best to respond promptly. P.S. Regarding your problem, I'm curious to know the size of your dataset. Is it possible that the learning rate is declining, but due to data precision, it's not discernible? For example, the learning rate could be 0.000095. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***> 从网易邮箱大师发来的云附件 | | | | | | | | | train.zip | | | | (189MB,2023年11月22日 17:18 到期) | | | | 下载 | | | | |

HaozheZhao · 2023-11-07T14:47:40Z

It appears that I am not receiving the email. Perhaps you could call me directly using the email in my profile.

tumanshu · 2023-11-08T05:57:23Z

It appears that I am not receiving the email. Perhaps you could call me directly using the email in my profile.

OK， I sent an email to [email protected].
Looking forward to your response

tumanshu changed the title ~~finetune promblem~~ finetune problem Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finetune problem #22

finetune problem #22

tumanshu commented Nov 6, 2023

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023 •

edited

Loading

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023 via email

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 8, 2023

finetune problem #22

finetune problem #22

Comments

tumanshu commented Nov 6, 2023

--multiple_choice True

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023 • edited Loading

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 7, 2023 via email

HaozheZhao commented Nov 7, 2023

tumanshu commented Nov 8, 2023

tumanshu commented Nov 7, 2023 •

edited

Loading