-
Notifications
You must be signed in to change notification settings - Fork 735
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
couple a3c questions / recommendations for generalizing beyond Atari #76
Comments
Fixed the NaN issue for the most part. I had to change the second to last activation function from relu to tanh. Apparently using a relu to softmax transition can cause NaNs in large networks. Also, I can confirm the findings of the paper: using a different policy for each agent can significantly help stability / learning. My algorithm was getting stuck on a local solution before, but now one of the agents follows a pure uniform random policy, another one is purely greedy, and the rest are greedy X% of the time (X varies by agent) and use the probabilities spit out by the network to determine actions otherwise. This has really helped break through the plateau it was hitting before. Also, I'm building a couple additional custom policies that follow a set of rules and a couple more that focus on specific sets of my larger action space (I have 12 threads to play with). I'm hoping that will help drive things a bit more quickly as well. Your implementation makes this all really easy to do within the get_action function, to which I can pass environment variables from the loop that calls it. |
Thanks for the feedback! This really helps a lot. I'll summon the other guys to answer your question :) |
thanks. I've been struggling with my own project for months now... this has definitely helped a ton, but I'm not quite there yet. As I tweak things, I'm still running into stability issues with my network returning NaNs occasionally. This could happen 5 hours into training, which is extremely aggravating. I'm tweaking your code to break out of training when that happens and create a copy of the weight files ever N episodes so I don't lose too much progress (I'm expecting this will take 1-2 weeks to train in the end). I guess one additional question I have while I'm on the subject would be about the actor critic optimizers. The paper talks about how the optimizer (RMSProp is what they used) worked best when it shared the "the elementwise squared gradients" across threads. It still works without doing that, just not as not nearly as well in large networks if I'm understanding the results correctly. I wonder if there's a way to do this here? Also, I can't find any good articles / discussion about how the learning rates should compare between the actor and the critic. Or if I want to use gradient clipping, how might I find a good threshold to use for each? This process of iterative tweaks is taking far too long in my case because it takes 10 minutes for my first set of episodes to complete. |
First, thanks for making this. It's very easy to get started with and has really helped me move things forward on a personal project of mine I've been struggling with for months. This is really awesome work. Thanks again.
In my efforts to tweak the code from your A3C cartpole implementation to work with my own custom OpenAI environment, I've discovered a few things that I think can help make it generalize a bit more.
def get_action(self, state,actionfilter): policy = self.actor.predict(np.reshape(state, [1, self.state_size]))[0] policy=np.multiply(policy,actionfilter) probs=policy/np.sum(policy) action=np.random.choice(self.action_size, 1, p=probs)[0] return action
where action filter is provided through a custom function with the environment. It would be easy enough to only use it if it was passed, or just default it to a vector of ones with the same size as the action space.
Anything greater than that can cause an actor to return NaNs and then the whole thing falls apart, and anything lower and it just inches along at a glacial pace. I wish I could take it up a bit, but the NaNs are killing me. I tried gradient clipping, but it's really hard finding a good threshold to use. Anyway, implementing different exploration policies should be pretty easy to do... might be worth checking out. I suppose it would also be possible to randomly pick a more abstract exploration type as well during initialization. Like have one that's pure greedy, another epsilon-greedy with some random epsilon, and maybe a couple other types of policies thrown in there for kicks. I'm going to test this out this week to see if it has any effect. I can report back if you're interested.
The text was updated successfully, but these errors were encountered: