Finetune issue #142

white2018 · 2024-09-20T01:40:08Z

@Yuliang-Liu Nice work! I run into finetune issue as follows

2 GPUs of NVIDIA A800 80G is employed during training. The finetune script looks like
CUDA_VISIBLE_DEVICES=$gpu torchrun
--nnodes=1
--node_rank=0
--master_addr=0.0.0.0
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_finetune.py
--model_name_or_path "models/OpenGVLab/InternVL2-2B"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "shell/data/train-finetune.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 6
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp True
--freeze_backbone True
--use_llm_lora 16
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 True
--num_train_epochs 1
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-6
--weight_decay 0.01
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 4096
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"

Could you pls give me some clues to fix it? Thanks a lot

mxin262 · 2024-09-21T08:24:08Z

Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.

white2018 · 2024-10-15T13:33:00Z

Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.

The training commands looks like:
++ CUDA_VISIBLE_DEVICES=0,1
++ torchrun --nnodes=1 --node_rank=0 --master_addr=0.0.0.0 --nproc_per_node=2 --master_port=20135 internvl/train/internvl_chat_finetune.py --model_name_or_path models/OpenGVLab/InternVL2-2B --conv_style internlm2-chat --output_dir minimonkey_chat_lora --meta_path shell/data/train-finetune.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-6 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard

which leads to core, the snapshot is as follow:

mxin262 · 2024-10-20T12:59:40Z

Do you encounter this issue when using zero_stage3_config.json?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetune issue #142

Finetune issue #142

white2018 commented Sep 20, 2024

mxin262 commented Sep 21, 2024

white2018 commented Oct 15, 2024 •

edited

Loading

mxin262 commented Oct 20, 2024

Finetune issue #142

Finetune issue #142

Comments

white2018 commented Sep 20, 2024

mxin262 commented Sep 21, 2024

white2018 commented Oct 15, 2024 • edited Loading

mxin262 commented Oct 20, 2024

white2018 commented Oct 15, 2024 •

edited

Loading