Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

Training Lora always stops on 4 epoch of 10 #316

Open
PsypmP opened this issue Nov 19, 2023 · 0 comments
Open

Training Lora always stops on 4 epoch of 10 #316

PsypmP opened this issue Nov 19, 2023 · 0 comments

Comments

@PsypmP
Copy link

PsypmP commented Nov 19, 2023

4080 laptop 12gb

RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16BF, lda, b, CUDA_R_16BF, ldb, &fbeta, c, CUDA_R_16BF, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

bla,bla,bla

raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['F:\!NeuroNet\LORA\Kohya\venv\Scripts\python.exe', './sdxl_train_network.py', '--enable_bucket', '--min_bucket_reso=256', '--max_bucket_reso=2048', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--train_data_dir=F:/!NeuroNet/LORA/TRAIN3\img', '--reg_data_dir=F:/!NeuroNet/LORA/TRAIN3\reg', '--resolution=1024,1024', '--output_dir=F:/!NeuroNet/LORA/TRAIN3\model', '--logging_dir=F:/!NeuroNet/LORA/TRAIN3\log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=0.0003', '--unet_lr=0.0003', '--network_dim=128', '--output_name=AlexeyF', '--lr_scheduler_num_cycles=10', '--no_half_vae', '--learning_rate=0.0003', '--lr_scheduler=constant', '--train_batch_size=1', '--max_train_steps=8000', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=bf16', '--caption_extension=.txt', '--cache_latents', '--cache_latents_to_disk', '--optimizer_type=Adafactor', '--optimizer_args', 'scale_parameter=False', 'relative_step=False', 'warmup_init=False', '--max_data_loader_n_workers=0', '--bucket_reso_steps=64', '--gradient_checkpointing', '--xformers', '--bucket_no_upscale', '--noise_offset=0.0']' returned non-zero exit status 1.

Config
accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket
--min_bucket_reso=256 --max_bucket_reso=2048
--pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"
--train_data_dir="F:/!NeuroNet/LORA/TRAIN3\img" --reg_data_dir="F:/!NeuroNet/LORA/TRAIN3\reg"
--resolution="1024,1024" --output_dir="F:/!NeuroNet/LORA/TRAIN3\model"
--logging_dir="F:/!NeuroNet/LORA/TRAIN3\log" --network_alpha="1" --save_model_as=safetensors
--network_module=networks.lora --text_encoder_lr=0.0003 --unet_lr=0.0003 --network_dim=128
--output_name="AlexeyF" --lr_scheduler_num_cycles="10" --no_half_vae --learning_rate="0.0003"
--lr_scheduler="constant" --train_batch_size="1" --max_train_steps="8000"
--save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16"
--caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor"
--optimizer_args scale_parameter=False relative_step=False warmup_init=False
--max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers
--bucket_no_upscale --noise_offset=0.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant