Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finetune issue #142

Open
white2018 opened this issue Sep 20, 2024 · 3 comments
Open

Finetune issue #142

white2018 opened this issue Sep 20, 2024 · 3 comments

Comments

@white2018
Copy link

@Yuliang-Liu Nice work! I run into finetune issue as follows
e20823216a167f50ed234a1468f8a51

2 GPUs of NVIDIA A800 80G is employed during training. The finetune script looks like
CUDA_VISIBLE_DEVICES=$gpu torchrun
--nnodes=1
--node_rank=0
--master_addr=0.0.0.0
--nproc_per_node=${GPUS}
--master_port=${MASTER_PORT}
internvl/train/internvl_chat_finetune.py
--model_name_or_path "models/OpenGVLab/InternVL2-2B"
--conv_style "internlm2-chat"
--output_dir ${OUTPUT_DIR}
--meta_path "shell/data/train-finetune.json"
--overwrite_output_dir True
--force_image_size 448
--max_dynamic_patch 6
--down_sample_ratio 0.5
--drop_path_rate 0.0
--freeze_llm True
--freeze_mlp True
--freeze_backbone True
--use_llm_lora 16
--vision_select_layer -1
--dataloader_num_workers 4
--bf16 True
--num_train_epochs 1
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE}
--gradient_accumulation_steps ${GRADIENT_ACC}
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 200
--save_total_limit 1
--learning_rate 4e-6
--weight_decay 0.01
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--max_seq_length 4096
--do_train True
--grad_checkpoint True
--group_by_length True
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--deepspeed "zero_stage1_config.json"
--report_to "tensorboard"

Could you pls give me some clues to fix it? Thanks a lot

@mxin262
Copy link
Collaborator

mxin262 commented Sep 21, 2024

Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.

@white2018
Copy link
Author

white2018 commented Oct 15, 2024

Hi~ The default is to run with eight GPUs. If you use two GPUs, you need to set --nproc_per_node to 2.

The training commands looks like:
++ CUDA_VISIBLE_DEVICES=0,1
++ torchrun --nnodes=1 --node_rank=0 --master_addr=0.0.0.0 --nproc_per_node=2 --master_port=20135 internvl/train/internvl_chat_finetune.py --model_name_or_path models/OpenGVLab/InternVL2-2B --conv_style internlm2-chat --output_dir minimonkey_chat_lora --meta_path shell/data/train-finetune.json --overwrite_output_dir True --force_image_size 448 --max_dynamic_patch 6 --down_sample_ratio 0.5 --drop_path_rate 0.0 --freeze_llm True --freeze_mlp True --freeze_backbone True --use_llm_lora 16 --vision_select_layer -1 --dataloader_num_workers 4 --bf16 True --num_train_epochs 1 --per_device_train_batch_size 2 --gradient_accumulation_steps 4 --evaluation_strategy no --save_strategy steps --save_steps 200 --save_total_limit 1 --learning_rate 4e-6 --weight_decay 0.01 --warmup_ratio 0.03 --lr_scheduler_type cosine --logging_steps 1 --max_seq_length 4096 --do_train True --grad_checkpoint True --group_by_length True --dynamic_image_size True --use_thumbnail True --ps_version v2 --deepspeed zero_stage1_config.json --report_to tensorboard

which leads to core, the snapshot is as follow:
image

@mxin262
Copy link
Collaborator

mxin262 commented Oct 20, 2024

Do you encounter this issue when using zero_stage3_config.json?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants