Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Own POMDP model #1

Open
friedsela opened this issue Dec 12, 2018 · 10 comments
Open

Own POMDP model #1

friedsela opened this issue Dec 12, 2018 · 10 comments

Comments

@friedsela
Copy link

friedsela commented Dec 12, 2018

Very nice project. Thanks!
I wanted to check it on my own POMDP model but I get errors. It seems that just changing the file name doesn't work. Been spending some time trying to find the problem...

@namoshizun
Copy link
Owner

Thanks @friedsela !
Could you give a bit more descriptive details regarding the model you have created, and what are the error messages? I should be able to look into the issue later this weekend.

@friedsela
Copy link
Author

friedsela commented Dec 14, 2018

For example, taking the file 1d.POMDP from http://www.pomdp.org/examples/ gives the error

'mab_bv1': UtilityFunction.mab_bv1(min(self.model.costs), C),

TypeError: 'NoneType' object is not iterable

If I add the line
costs: 0 0
to the file
then it gives the error

File "mtrand.pyx", line 1146, in mtrand.RandomState.choice
ValueError: probabilities do not sum to 1

@namoshizun
Copy link
Owner

namoshizun commented Dec 18, 2018

Hi @friedsela, I've just pushed a batch of minor fixes for addressing some issues in the POMDP parser and the utility function creation. But that didn't really solve the issue you encountered when using on the 1d.POMDP dataset.

Basically, the current implementation does not yet allow for querying the action reward given (s_i, s_j, o), see line 89 in model.py. The only supported is (s_i), i.e., reward conditioned only on the current state, which was the only case of study during the development of this package..

@friedsela
Copy link
Author

So, if I understand correctly, if in my POMDP the reward depends only on s_i and a_i, then it should work?
Also, as far as I understand, this is only a problem of the parser, right? For the solvers it doesn't matter?

@namoshizun
Copy link
Owner

So, if I understand correctly, if in my POMDP the reward depends only on s_i and a_i, then it should work?
Also, as far as I understand, this is only a problem of the parser, right? For the solvers it doesn't matter?

Yes, you are correct. The only supported POMDP reward specification is like:
R: my-action : current-state : * : * 100

I couldn't remember exactly whether the parser already supports the full (s_i, s_j, o) spec... maybe it already does but it is just I chose to not consider it in the solver...

@friedsela
Copy link
Author

friedsela commented Dec 20, 2018

Ok, great! I actually have my own implementation for POMCP, but as I need to solve many models, I wanted to see weather there is a faster implementation than my own. I also wanted to see weather point-based is faster than POMCP. From your experience, which one is faster?

@namoshizun
Copy link
Owner

namoshizun commented Dec 20, 2018

Hmmm I don't really think they are even comparable as they are solving different POMDPs. POMCP is designed for solving online planning problems in POMDP, whereas PBVI is for off-line learning. It also depends on the model complexity, POMCP was designed for approximately solving very large POMDPs, so the model complexity should be made explicit if a comparison is must... Previously, I didn't consider comparing their speed.

@friedsela
Copy link
Author

friedsela commented Dec 22, 2018

Thanks for your reply.

Fortunately in my model the rewards are only dependent on the action and the current state, so i'm able now to run your code. Could you please explain what is actually happening and what the output mean?
For example, I don't understand:

Why do you call 'pomdp.solve(T)' in each iteration? Don't you solve the POMDP once and then use its policy?

Why are the simulations numbers jumping? what do they mean?

What does '30 games played. Toal reward = 10.0' mean? That my POMDP was solved and 30 games were played and the mean is 10? If the horizon is 20 and 30 games were played, what then happend?

What I am looking for is to run a solver and then get a policy for which I can calculate the mean reward (by simulating many games). Is this possible with your code?

@namoshizun
Copy link
Owner

namoshizun commented Dec 24, 2018

Thanks for your reply.

Fortunately in my model the rewards are only dependent on the action and the current state, so i'm able now to run your code. Could you please explain what is actually happening and what the output mean?
For example, I don't understand:

Why do you call 'pomdp.solve(T)' in each iteration? Don't you solve the POMDP once and then use its policy?

For an off-line learning POMDP problem, you only need to solve the POMDP once to get an (maybe) optimal policy, then use that policy to decide all actions. But in an online setting (which is POMCP's case of application), you need to learn policies as you go along because the agent doesn't have access to the full POMDP specification --- kinda like walking in a maze where you could only see what's nearby...

Why are the simulations numbers jumping? what do they mean?

It doesn't have to be jumping. You can have a fixed number of MCTS simulations, but I just decided to have a fixed number of total simulation time lol.

What does '30 games played. Toal reward = 10.0' mean? That my POMDP was solved and 30 games were played and the mean is 10? If the horizon is 20 and 30 games were played, what then happend?

That means it has completed 30 times of the POMDP cycle: planning => action => observation => belief update.

What I am looking for is to run a solver and then get a policy for which I can calculate the mean reward (by simulating many games). Is this possible with your code?

I reckon currently it is only possible for PBVI..

@friedsela
Copy link
Author

friedsela commented Dec 24, 2018

Thank for the reply. As far as I understand, every online solver can be made offline if we let it run on a simulator until it gets to an (maybe) optimal policy and this is how I use POMCP. I guess it's not how it was meant to be applied.

So suppose I want to use your PBVI solver.
If I set the horizon to, say, 20, and let max_play be 30, I don't see that it resets to begin a new game. I find it confusing how you use max_play. If we have a certain horizon H , then one play should be H steps. I think it is not interesting to count plays (in the sense of steps), as you do it, but rather full games of full H horizon steps.
Last question (for now): I set the horizon manually before 'for i in range(params.max_play):' because I don't see where you set it as a parameter. Shouldn't it be one of the 'parser.add_argument' in main?

Hope it's fine me asking so many questions and suggesting things, it is that I think your project is very good and that the community lacks good python implementations of POMDP solvers...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants