We propose a safe model-free RL algorithm with a novel multiplicative value function consisting of a safety critic and a reward critic. The safety critic predicts the probability of constraint violation and discounts the reward critic that only estimates constraint-free returns. By splitting responsibilities, we facilitate the learning task leading to increased sample efficiency.
A Multiplicative Value Function for Safe and Efficient Reinforcement Learning
Nick Bührer, Zhejun Zhang, Alexander Liniger, Fisher Yu and Luc Van Gool.
@inproceedings{buehrer2023saferl,
title = {A Multiplicative Value Function for Safe and Efficient Reinforcement Learning},
author = {B{\"u}hrer, Nick and Zhang, Zhejun and Liniger, Alexander and Yu, Fisher and Van Gool, Luc},
booktitle = {International Conference on Intelligent Robots and Systems (IROS)},
year = {2023}
}
Create the conda environment by running
conda env create -f conda_env.yaml
Note that our implementation is build upon stable-baselines3 1.2.0 (as defined in the yaml) and might not work with newer versions.
All the experiments can be launched from main.py
. For the experiment configuration, we use
hydra. The following environments are supported for now:
- Lunar Lander Safe
- Car Racing Safe
- Point Robot Navigation
For running PPO Mult V1
in Lunar Lander Safe
, simply execute:
python main.py +lunar_lander=ppo_mult_v1
For executing the Lagrangian baseline PPO Lagrange
in Car Racing Safe
, simply execute:
python main.py +car_racing=ppo_lagrange
All the experiment configs can be found under the experiments
folder. In the example of lunar lander, the experiment
is under experiments/lunar_lander/ppo_mult_v1.yaml
.