stacked_ship_video.mp4
This repo contains the code for simple Controlnet module for CogvideoX model.
ComfyUI-CogVideoXWrapper supports controlnet pipeline. See an example file.
Supported models for 5B:
- Canny (HF Model Link)
- Hed (HF Model Link)
Supported models for 2B:
- Canny (HF Model Link)
- Hed (HF Model Link)
Clone repo
git clone https://github.com/TheDenk/cogvideox-controlnet.git
cd cogvideox-controlnet
Create venv
python -m venv venv
source venv/bin/activate
Install requirements
pip install -r requirements.txt
python -m inference.cli_demo \
--video_path "resources/car.mp4" \
--prompt "The camera follows behind red car. Car is surrounded by a panoramic view of the vast, azure ocean. Seagulls soar overhead, and in the distance, a lighthouse stands sentinel, its beam cutting through the twilight. The scene captures a perfect blend of adventure and serenity, with the car symbolizing freedom on the open sea." \
--controlnet_type "canny" \
--base_model_path THUDM/CogVideoX-5b \
--controlnet_model_path TheDenk/cogvideox-5b-controlnet-canny-v1
python -m inference.gradio_web_demo \
--controlnet_type "canny" \
--base_model_path THUDM/CogVideoX-5b \
--controlnet_model_path TheDenk/cogvideox-5b-controlnet-canny-v1
CUDA_VISIBLE_DEVICES=0 python -m inference.cli_demo \
--video_path "resources/car.mp4" \
--prompt "The camera follows behind red car. Car is surrounded by a panoramic view of the vast, azure ocean. Seagulls soar overhead, and in the distance, a lighthouse stands sentinel, its beam cutting through the twilight. The scene captures a perfect blend of adventure and serenity, with the car symbolizing freedom on the open sea." \
--controlnet_type "canny" \
--base_model_path THUDM/CogVideoX-5b \
--controlnet_model_path TheDenk/cogvideox-5b-controlnet-canny-v1 \
--num_inference_steps 50 \
--guidance_scale 6.0 \
--controlnet_weights 1.0 \
--controlnet_guidance_start 0.0 \
--controlnet_guidance_end 0.5 \
--output_path "./output.mp4" \
--seed 42
The 2B model requires 48 GB VRAM (For example A6000) and 80 GB for 5B. But it depends on the number of transformer blocks which default is 8 (controlnet_transformer_num_layers
parameter in the config).
OpenVid-1M dataset was taken as the base variant. CSV files for the dataset you can find here.
For start training you need fill the config files accelerate_config_machine_single.yaml
and finetune_single_rank.sh
.
In accelerate_config_machine_single.yaml
set parameternum_processes: 1
to your GPU count.
In finetune_single_rank.sh
:
- Set
MODEL_PATH for
base CogVideoX model. Default is THUDM/CogVideoX-2b. - Set
CUDA_VISIBLE_DEVICES
(Default is 0). - (For OpenVid dataset) Set
video_root_dir
to directory with video files andcsv_path
.
Run taining
cd training
bash finetune_single_rank.sh
Original code and models CogVideoX.
Issues should be raised directly in the repository. For professional support and recommendations please [email protected].