GROOVE is the official implementation of the following publications:
- Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design, NeurIPS 2023 [ArXiv | NeurIPS | Twitter]
- Learned Policy Gradient (LPG),
- Prioritized Level Replay (PLR),
- General RL Algorithms Obtained Via Environment Design (GROOVE),
- Grid-World environment from the LPG paper.
- Discovering Temporally-Aware Reinforcement Learning Algorithms, ICLR 2024 [ArXiv]
- Temporally-Aware LPG (TA-LPG),
- Evolutionary Strategies (ES) with antithetic task sampling.
All scripts are JIT-compiled end-to-end and make extensive use of JAX-based parallelization, enabling meta-training in under 3 hours on a single GPU!
Update (April 2023): Misreported LPG ES hyperparameters in repo + paper, specifically initial learning rate (1e-4
-> 1e-2
) and sigma (3e-3
-> 1e-1
). Now updated.
Setup | Running experiments | Citation
All requirements are found in setup/
, with requirements-base.txt
containing the majority of packages, requirements-cpu.txt
containing CPU packages, and requirements-gpu.txt
containing GPU packages.
Some key packages include:
- RL Environments:
gymnax
- Neural Networks:
flax
- Optimization:
optax
,evosax
- Logging:
wandb
pip install $(cat setup/requirements-base.txt setup/requirements-cpu.txt)
- Build docker image
cd setup/docker & ./build_gpu.sh & cd ../..
- (To enable WandB logging) Add your account key to
setup/wandb_key
:
echo [KEY] > setup/wandb_key
Meta-training is executed with python3.8 train.py
, with all arguments found in experiments/parse_args.py
.
Argument | Description |
---|---|
--env_mode [env_mode] |
Sets the environment mode (below). |
--num_agents [agents] |
Sets the meta-training batch size. |
--num_mini_batches [mini_batches] |
Computes each update in sequential mini-batches, in order to execute large batches with little memory. RECOMMENDED: lower this to the smallest value that fits in memory. |
--debug |
Disables JIT compilation. |
--log --wandb_entity [entity] --wandb_project [project] |
Enables logging to WandB. |
Environment mode | Description | Lifetime (# of updates) |
---|---|---|
tabular |
Five tabular levels from LPG | Variable |
mazes |
Maze levels from MiniMax | 2500 |
all_shortlife |
Uniformly sampled levels | 250 |
all_vrandlife |
Uniformly sampled levels | 10-250 (Log-sampled) |
Experiment | Command | Example run (WandB) |
---|---|---|
LPG (meta-gradient) | python3.8 train.py --num_agents 512 --num_mini_batches 16 --train_steps 5000 --log --wandb_entity [entity] --wandb_project [project] |
Link |
GROOVE | LPG with --score_function alg_regret (algorithmic regret is computed every step due to end-to-end compilation, so currently very inefficient) |
TBC |
TA-LPG | LPG with --num_mini_batches 8 --train_steps 2500 --use_es --lifetime_conditioning --lpg_learning_rate 0.01 --env_mode all_vrandlife |
TBC |
To execute CPU or GPU docker containers, run the relevant script (with the GPU index as the first argument for the GPU script).
./run_gpu.sh [GPU id] python3.8 train.py [args]
If you use this implementation in your work, please cite us with the following:
@inproceedings{jackson2023discovering,
author={Jackson, Matthew Thomas and Jiang, Minqi and Parker-Holder, Jack and Vuorio, Risto and Lu, Chris and Farquhar, Gregory and Whiteson, Shimon and Foerster, Jakob Nicolaus},
booktitle = {Advances in Neural Information Processing Systems},
title = {Discovering General Reinforcement Learning Algorithms with Adversarial Environment Design},
volume = {36},
year = {2023}
}
@inproceedings{jackson2024discovering,
author={Jackson, Matthew Thomas and Lu, Chris and Kirsch, Louis and Lange, Robert Tjarko and Whiteson, Shimon and Foerster, Jakob Nicolaus},
booktitle = {International Conference on Learning Representations},
title = {Discovering Temporally-Aware Reinforcement Learning Algorithms},
volume = {12},
year = {2024}
}
- Speed up GROOVE by removing recomputation of algorithmic regret every step.
- Meta-testing script for checkpointed models.
- Alternative UED metrics (PVL, MaxMC).