Scaling RLLib for generic simulation environments on Theta at Argonne Leadership Computing Facility.
Suraj Pawar, Sahil Bhola, & Romit Maulik
- Tensorflow 1.14.0
- Gym 0.17.1
- Ray 0.7.6 - Install with
pip install ray[rllib]==0.7.6
- Numpy 1.16.1
The user can setup a seprate environment for RLLib and all the requirements as follows:
- Before creating your environment, the use needs to load some modules. Add the below modules in your
.bashrc
file:
module load miniconda-3.6/conda-4.5.12
module load intelpython36
and then executre source .bashrc
- In the terminal client enter the following where yourenvname is the name you want to call your environment, and replace x.x with the Python version you wish to use. We have tested with 3.6.8 version of the Python.
conda create -n yourenvname python=x.x
- Install all the requirements with
pip install -r requirements.txt
- To use RLLib on Theta, the user needs to install MPI using the script provided in
install_mpi
folder. Go inside theinstall_mpi
folder and execute
bash install_mpi4py.sh
- To activate or switch into your virtual environment, simply type the following where yourenvname is the name you gave to your environement at creation.
source activate yourenvname
The idea is as follows:
- Start ray on the head node and retrieve the
head_redis_address
by running
ray start --num_cpus 1 --redis-port=10100
note that the --num_cpus
argument may vary from machine to machine.
- Start workers on other nodes by running
ray start --num_cpus 1 --address={head_redis_address}
where head_redis_address
is obtained after the head node process is started. If this is successfully executed - you are good to execute RLLib in a distributed fashion. Change --num_cpus
argument for more ranks on one compute node.
- Within your python script which executes RLLib you must have a statement
import ray
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ray-address")
args = parser.parse_args()
ray.init(address=args.ray_address)
which lets you call the script (in our following MWE this will be train_ppo.py
) as python train_ppo.py --ray-address head_redis_address
. An important point here is that this must be called on the head node alone and the RL workers will be automatically distributed (the beauty of Ray/RLLib).
-
All this business can be packaged quite effectively using
start_ray.py
which uses subprocess to calltrain_ppo.py
. For an example see here. -
This distributed RL runs without any trouble at all on my laptop for 4 workers and can be called by running
mpirun -np 4 python start_ray.py
Execute the same procedure by running an interactive job on Theta using
qsub -A datascience -t 60 -q debug-cache-quad -n 2 -I
and using aprun as aprun -n 4 -N 2 python start_ray.py
. The logs for starting ray (at start_ray.log
) show success:
05/01/2020 07:36:14 PM | Waiting for broadcast...
05/01/2020 07:36:14 PM | Waiting for broadcast...
05/01/2020 07:36:14 PM | Waiting for broadcast...
05/01/2020 07:36:14 PM | Ready to run ray head
05/01/2020 07:36:25 PM | Head started at: 10.128.15.16:6379
05/01/2020 07:36:25 PM | Ready to broadcast head_redis_address: 10.128.15.16:6379
05/01/2020 07:36:25 PM | Broadcast done... received head_redis_address= 10.128.15.16:6379
05/01/2020 07:36:25 PM | Broadcast done... received head_redis_address= 10.128.15.16:6379
05/01/2020 07:36:25 PM | Broadcast done...
05/01/2020 07:36:25 PM | Broadcast done... received head_redis_address= 10.128.15.16:6379
05/01/2020 07:36:25 PM | Waiting for workers to start...
05/01/2020 07:36:25 PM | Worker on rank 3 with ip 10.128.15.16 will connect to head-redis-address=10.128.15.16:6379
05/01/2020 07:36:25 PM | Worker on rank 2 with ip 10.128.15.16 will connect to head-redis-address=10.128.15.16:6379
05/01/2020 07:36:25 PM | Worker on rank 1 with ip 10.128.15.16 will connect to head-redis-address=10.128.15.16:6379
05/01/2020 07:36:32 PM | Worker on rank 3 with ip 10.128.15.16 is connected!
05/01/2020 07:36:32 PM | Worker on rank 2 with ip 10.128.15.16 is connected!
05/01/2020 07:36:33 PM | Worker on rank 1 with ip 10.128.15.16 is connected!
05/01/2020 07:36:33 PM | Workers are all running!
05/01/2020 07:36:33 PM | Ready to start driver!