Allen Z. Ren1, Justin Lidard1, Lars L. Ankile2,3, Anthony Simeonov3
Pulkit Agrawal3, Anirudha Majumdar1, Benjamin Burchfiel4, Hongkai Dai4, Max Simchowitz3,5
1Princeton University, 2Harvard University, 3Masschusetts Institute of Technology
4Toyota Research Institute, 5Carnegie Mellon University
DPPO is an algorithmic framework and set of best practices for fine-tuning diffusion-based policies in continuous control and robot learning tasks.
- Clone the repository
git clone [email protected]:irom-lab/dppo.git
cd dppo
- Install core dependencies with a conda environment (if you do not plan to use Furniture-Bench, a higher Python version such as 3.10 can be installed instead) on a Linux machine with a Nvidia GPU.
conda create -n dppo python=3.8 -y
conda activate dppo
pip install -e .
- Install specific environment dependencies (Gym / Robomimic / D3IL / Furniture-Bench) or all dependencies
pip install -e .[gym] # or [robomimic], [d3il], [furniture]
pip install -e .[all]
-
Install MuJoCo for Gym and/or Robomimic. Install D3IL. Install IsaacGym and Furniture-Bench
-
Set environment variables for data and logging directory (default is
data/
andlog/
), and set WandB entity (username or team name)
source script/set_path.sh
Note: You may skip pre-training if you would like to use the default checkpoint (available for download) for fine-tuning.
Pre-training data for all tasks are pre-processed and can be found at here. Pre-training script will download the data (including normalization statistics) automatically to the data directory.
All the configs can be found under cfg/<env>/pretrain/
. A new WandB project may be created based on wandb.project
in the config file; set wandb=null
in the command line to test without WandB logging.
# Gym - hopper/walker2d/halfcheetah
python script/train.py --config-name=pre_diffusion_mlp \
--config-dir=cfg/gym/pretrain/hopper-medium-v2
# Robomimic - lift/can/square/transport
python script/train.py --config-name=pre_diffusion_mlp \
--config-dir=cfg/robomimic/pretrain/can
# D3IL - avoid_m1/m2/m3
python script/train.py --config-name=pre_diffusion_mlp \
--config-dir=cfg/d3il/pretrain/avoid_m1
# Furniture-Bench - one_leg/lamp/round_table_low/med
python script/train.py --config-name=pre_diffusion_mlp \
--config-dir=cfg/furniture/pretrain/one_leg_low
See here for details of the experiments in the paper.
Pre-trained policies used in the paper can be found here. Fine-tuning script will download the default checkpoint automatically to the logging directory.
All the configs can be found under cfg/<env>/finetune/
. A new WandB project may be created based on wandb.project
in the config file; set wandb=null
in the command line to test without WandB logging.
# Gym - hopper/walker2d/halfcheetah
python script/train.py --config-name=ft_ppo_diffusion_mlp \
--config-dir=cfg/gym/finetune/hopper-v2
# Robomimic - lift/can/square/transport
python script/train.py --config-name=ft_ppo_diffusion_mlp \
--config-dir=cfg/robomimic/finetune/can
# D3IL - avoid_m1/m2/m3
python script/train.py --config-name=ft_ppo_diffusion_mlp \
--config-dir=cfg/d3il/finetune/avoid_m1
# Furniture-Bench - one_leg/lamp/round_table_low/med
python script/train.py --config-name=ft_ppo_diffusion_mlp \
--config-dir=cfg/furniture/finetune/one_leg_low
Note: In Gym, Robomimic, and D3IL tasks, we run 40, 50, and 50 parallelized MuJoCo environments on CPU, respectively. If you would like to use fewer environments (given limited CPU threads, or GPU memory for rendering), you can reduce env.n_envs
and increase train.n_steps
, so the total number of environment steps collected in each iteration (n_envs x n_steps x act_steps) remains roughly the same. Try to set train.n_steps
a multiple of env.max_episode_steps / act_steps
, and be aware that we only count episodes finished within an iteration for eval. Furniture-Bench tasks run IsaacGym on a single GPU.
To fine-tune your own pre-trained policy instead, override base_policy_path
to your own checkpoint, which is saved under checkpoint/
of the pre-training directory. You can set base_policy_path=<path>
in the command line when launching fine-tuning.
See here for details of the experiments in the paper.
- Furniture-Bench tasks can be visualized in GUI by specifying
env.specific.headless=False
andenv.n_envs=1
in fine-tuning configs. - D3IL environment can be visualized in GUI by
+env.render=True
,env.n_envs=1
, andtrain.render.num=1
. There is a basic script atscript/test_d3il_render.py
. - Videos of trials in Robomimic tasks can be recorded by specifying
env.save_video=True
,train.render.freq=<iterations>
, andtrain.render.num=<num_video>
in fine-tuning configs.
Our diffusion implementation is mostly based on Diffuser and at model/diffusion/diffusion.py
and model/diffusion/diffusion_vpg.py
. PPO specifics are implemented at model/diffusion/diffusion_ppo.py
. The main training script is at agent/finetune/train_ppo_diffusion_agent.py
that follows CleanRL.
denoising_steps
: number of denoising steps (should always be the same for pre-training and fine-tuning regardless the fine-tuning scheme)ft_denoising_steps
: number of fine-tuned denoising stepshorizon_steps
: predicted action chunk size (should be the same asact_steps
, executed action chunk size, with MLP. Can be different with UNet, e.g.,horizon_steps=16
andact_steps=8
)model.gamma_denoising
: denoising discount factormodel.min_sampling_denoising_std
: , minimum amount of noise when sampling at a denoising stepmodel.min_logprob_denoising_std
: , minimum standard deviation when evaluating likelihood at a denoising stepmodel.clip_ploss_coef
: PPO clipping ratio
To use DDIM fine-tuning, set denoising_steps=100
in pre-training and set model.use_ddim=True
, model.ddim_steps
to the desired number of total DDIM steps, and ft_denoising_steps
to the desired number of fine-tuned DDIM steps. In our Furniture-Bench experiments we use denoising_steps=100
, model.ddim_steps=5
, and ft_denoising_steps=5
.
Pre-training script is at agent/pretrain/train_diffusion_agent.py
. The pre-training dataset loader assumes a npz file containing numpy arrays states
, actions
, images
(if using pixel) and traj_length
, where states
and actions
have the shape of num_total_steps x obs_dim/act_dim, images
num_total_steps x C (concatenated if multiple images) x H x W, and traj_length
is a 1-D array for indexing across num_total_steps.
Note: The current implementation does not support loading history observations (only using observation at the current timestep). If needed, you can modify here.
We follow the Gym format for interacting with the environments. The vectorized environments are initialized at make_async (called in the parent fine-tuning agent class here). The current implementation is not the cleanest as we tried to make it compatible with Gym, Robomimic, Furniture-Bench, and D3IL environments, but it should be easy to modify and allow using other environments. We use multi_step wrapper for history observations (not used currently) and multi-environment-step action execution. We also use environment-specific wrappers such as robomimic_lowdim and furniture for observation/action normalization, etc. You can implement a new environment wrapper if needed.
- IsaacGym simulation can become unstable at times and lead to NaN observations in Furniture-Bench. The current env wrapper does not handle NaN observations.
This repository is released under the MIT license. See LICENSE.
- Diffuser, Janner et al.: general code base and DDPM implementation
- Diffusion Policy, Chi et al.: general code base especially the env wrappers
- CleanRL, Huang et al.: PPO implementation
- IBRL, Hu et al.: ViT implementation
- D3IL, Jia et al.: D3IL benchmark
- Robomimic, Mandlekar et al.: Robomimic benchmark
- Furniture-Bench, Heo et al.: Furniture-Bench benchmark
- AWR, Peng et al.: DAWR baseline (modified from AWR)
- DIPO, Yang et al.: DIPO baseline
- IDQL, Hansen-Estruch et al.: IDQL baseline
- DQL, Wang et al.: DQL baseline
- QSM, Psenka et al.: QSM baseline
- Score SDE, Song et al.: diffusion exact likelihood