You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using device: cuda
calculating total steps
100%|████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.82it/s]
total steps = 3914
Let's use 4 GPUs!
starting training
epoch 1
time: 2023-01-13 11:48:51.538218
/u01/zourui/anaconda3/envs/GPT/lib/python3.8/site-packages/torch/nn/parallel/functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
/u01/zourui/anaconda3/envs/GPT/lib/python3.8/site-packages/transformers/optimization.py:166: UserWarning: This overload of add is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
now time: 11:49. Step 1 of piece 0 of epoch 1, loss 9.667740821838379
now time: 11:49. Step 2 of piece 0 of epoch 1, loss 9.682665824890137
now time: 11:49. Step 3 of piece 0 of epoch 1, loss 9.685418128967285
now time: 11:49. Step 4 of piece 0 of epoch 1, loss 9.6702299118042
now time: 11:49. Step 5 of piece 0 of epoch 1, loss 9.668827056884766
now time: 11:49. Step 6 of piece 0 of epoch 1, loss 9.66973876953125
now time: 11:49. Step 7 of piece 0 of epoch 1, loss 9.65914535522461
The text was updated successfully, but these errors were encountered:
我用三块卡训练得时候会出现这个错,然后我去查了一圈,发现有一个四块卡报RuntimeError: Input tensor at index 3 has invalid shape [2, 2, 16, 128, 64] but expected [2, 4, 16, 128, 64]的,然后我就又改回了四块卡训练,然后就很奇怪的跑通了。。但是不知道为什么。。
args:
Namespace(batch_size=8, device='5,6,1,4', epochs=5, fp16=False, fp16_opt_level='O1', gradient_accumulation=1, log_step=1, lr=0.00015, max_grad_norm=1.0, model_config='config/model_config_small.json', num_pieces=100, output_dir='model/', pretrained_model='', raw=False, raw_data_path='data/data/doupo/train.json', segment=False, stride=768, tokenized_data_path='data/tokenized/', tokenizer_path='cache/vocab_small.txt', warmup_steps=2000)
config:
{
"attn_pdrop": 0.1,
"embd_pdrop": 0.1,
"finetuning_task": null,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"n_ctx": 1024,
"n_embd": 768,
"n_head": 12,
"n_layer": 10,
"n_positions": 1024,
"num_labels": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pruned_heads": {},
"resid_pdrop": 0.1,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"torchscript": false,
"use_bfloat16": false,
"vocab_size": 13317
}
using device: cuda
calculating total steps
100%|████████████████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 92.82it/s]
total steps = 3914
Let's use 4 GPUs!
starting training
epoch 1
time: 2023-01-13 11:48:51.538218
/u01/zourui/anaconda3/envs/GPT/lib/python3.8/site-packages/torch/nn/parallel/functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
/u01/zourui/anaconda3/envs/GPT/lib/python3.8/site-packages/transformers/optimization.py:166: UserWarning: This overload of add is deprecated:
add_(Number alpha, Tensor other)
Consider using one of the following signatures instead:
add_(Tensor other, *, Number alpha) (Triggered internally at /pytorch/torch/csrc/utils/python_arg_parser.cpp:1005.)
exp_avg.mul_(beta1).add_(1.0 - beta1, grad)
now time: 11:49. Step 1 of piece 0 of epoch 1, loss 9.667740821838379
now time: 11:49. Step 2 of piece 0 of epoch 1, loss 9.682665824890137
now time: 11:49. Step 3 of piece 0 of epoch 1, loss 9.685418128967285
now time: 11:49. Step 4 of piece 0 of epoch 1, loss 9.6702299118042
now time: 11:49. Step 5 of piece 0 of epoch 1, loss 9.668827056884766
now time: 11:49. Step 6 of piece 0 of epoch 1, loss 9.66973876953125
now time: 11:49. Step 7 of piece 0 of epoch 1, loss 9.65914535522461
The text was updated successfully, but these errors were encountered: