-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUESTING OVERVIEW OF DISTRIBUTED HANDYRL] #211
Comments
Hi @adypd97, thank you for your interest in HandyRL! First of all, after the training server launched, you need to run the workers in the VMs for worker: We illustrated the overview of the distributed architecture before in the Google Football Research competition. I hope this helps you. Thanks |
Hi @ikki407! Thanks for the link to the documentation! Very helpful! To the main issue: Yes, I ran 2 worker VMs following the steps you mention (also, I entered the public IP of server VM (learner) for both workers in the As further evidence for that I added a simple print statement to the following file def run(self):
print('waiting training')
while not self.shutdown_flag:
if len(self.episodes) < self.args['minimum_episodes']:
>>> print('here')
time.sleep(1)
continue
if self.steps == 0:
self.batcher.run()
print('started training')
model = self.train()
self.report_update(model, self.steps)
print('finished training') And in the output I get the following:
I hope you find this helpful in assisting me. In any case thanks once again! |
From your outputs, it seems that the server is not connecting to the workers. Next steps to debug...
|
What the worker process/VM looks like? If the workers are still running without any errors, there maybe exist some problems I didn’t watch before. |
Hello HandyRL Team!
First off, thanks for making such a useful repository for RL! I love it!
I am trying to understand how the distributed architecture of HandyRL works, but due to lack of documentation so far its been difficult to understand how it's implemented.
I'll give an example (following the Large Scale Training document in the repo):
I have 3 VMs running on GCP (1 as the server (the learner) and 2 other as workers). In the
config.yaml
file I entered the external IP (the document says its valid to enter the external IP too) of the learner in the worker args parameter for both workers (as per instructions in the document) and tried to run it. However, I don't see anything happen. In the following output the server appears to continue to sleep and does nothing.OUTPUT:
I was hoping you could provide some guidance as to how I can proceed. In any case, a documentation or brief but complete background on the distributed architecture would also be appreciated to debug the problem on my own.
Thank you!
The text was updated successfully, but these errors were encountered: