-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training with DQN('MlpPolicy') on Intersection Environment #586
Comments
Indeed, this reward plot is not very good, it's surprising that reward does not improve at all throughout training. It is true that the MlpPolicy is not best suited for this task and I had better results with a Transformer model, but in our paper we could still see some progress happening when training with an MlpPolicy and KinematicsObservation (see paper Figure 4, total reward increases from 2.1 to 3.8). So I'm not sure what is going on exactly. Maybe it would be worth investigating with simpler domains and progressively increasing difficulty: e.g. remove all other vehicles at first, does the vehicle learn to always drive at maximum speed? Then add a single vehicle (always with the same initial position and velocity), does the vehicle learn to avoid it? If everything is fine so far, and learning only fails when scaling to the full scene with random vehicles at random positions/speeds, then its probably a problem of representation / policy architecture. But if the algorithm struggles even in these simpler scenarios, there is probably something wrong in the environment definition or learning algorithm.
The config that I used is this one. So yes, I did include heading angles. I think they are relevant because it helps understanding if a vehicle is starting to turn, and in turn whether or not their path is going to cross yours. See Figure 7 in the paper, which showed high sensitivity of the trained policy to the heading angle. Of course, part of this information is already included in the vx/vy velocity (except when it is close to 0), but it doesnt harm to include it additionally. I also used absolute coordinates for intersection (but not for highway), is that your case too?
You can take inspiration from this script where I implemented the custom transformer policy, to be used with PPO and the highway env, but this can be ported to DQN and the intersection env. Alternatively, my original implementation of DQN + Transformer/MLP (the one used in the paper) is available in this colab. |
Thanks for your comprehensive and kind response, Edouard!
One problem is, I can't remove other vehicles, I mean even when I change "initial_vehicles_count" to 1 or 0, still there are other cars! One trick was make their speed zero, but at the end I need to modify their numbers. Is there any other part in the code that I should change it? Actually I think it's maybe related to "spawn_probability" as well.
yes, I did that too.
And I guess there is a mistake in the Testing part of this notebook. Instead of evaluation.train(), I guess it should be evaluation.test(), and also in "evaluation = Evaluation(env, agent, num_episodes=20, training = False, recover = True, display_agent=False)" we should put "recover = True" to use the latest model. |
Setting spawn_probability to 0 should help yes, but if it's not enough you should just edit intersection_env.py and comment out the content of
You're right! I wonder how I missed that... I'll fix it, thanks for the feedback. |
@eleurent Hi Edouard,
I've trained my model in both Highway and Intersection environment. All the hyperparameters are as same as each other. And I used DQN (MlpPolicy) for both of them. But the problem is, in Highway, the agent after 2000-3000 steps learns to prevent collision, but in intersection even with 8000 steps it does not learn anything special.
The reward function for Highway is { r_collision = -1, r_speed = 0.4 } and because I'm considering just longitudinal actions, It's not rewarded for lane changing.
The reward function for Intersection is { r_collision = -1, r_speed = 0.4, r_arrived = 0.8}.
Observation and reward are normalized in both of them. "features": ['presence', 'x', 'y', 'vx', 'vy'] are for both of them. Do you think adding heading angle for intersection could be more important?
As you can see Highway env converges, but Intersection does not. I guess one potential problem could be using MlpPolicy in Intersection. But do you have any recommendation? In your paper you've used transformers, but I don't know how to implement it. Is there any simpler solution?
And also, do you have any better suggestion for shaping my reward function in Intersection?
The purple plot is for Intersection.
The text was updated successfully, but these errors were encountered: