Own POMDP model #1

friedsela · 2018-12-12T19:29:42Z

Very nice project. Thanks!
I wanted to check it on my own POMDP model but I get errors. It seems that just changing the file name doesn't work. Been spending some time trying to find the problem...

namoshizun · 2018-12-13T23:37:45Z

Thanks @friedsela !
Could you give a bit more descriptive details regarding the model you have created, and what are the error messages? I should be able to look into the issue later this weekend.

friedsela · 2018-12-14T05:21:02Z

For example, taking the file 1d.POMDP from http://www.pomdp.org/examples/ gives the error

'mab_bv1': UtilityFunction.mab_bv1(min(self.model.costs), C),

TypeError: 'NoneType' object is not iterable

If I add the line
costs: 0 0
to the file
then it gives the error

File "mtrand.pyx", line 1146, in mtrand.RandomState.choice
ValueError: probabilities do not sum to 1

namoshizun · 2018-12-18T12:18:55Z

Hi @friedsela, I've just pushed a batch of minor fixes for addressing some issues in the POMDP parser and the utility function creation. But that didn't really solve the issue you encountered when using on the 1d.POMDP dataset.

Basically, the current implementation does not yet allow for querying the action reward given (s_i, s_j, o), see line 89 in model.py. The only supported is (s_i), i.e., reward conditioned only on the current state, which was the only case of study during the development of this package..

friedsela · 2018-12-19T15:53:36Z

So, if I understand correctly, if in my POMDP the reward depends only on s_i and a_i, then it should work?
Also, as far as I understand, this is only a problem of the parser, right? For the solvers it doesn't matter?

namoshizun · 2018-12-20T09:05:14Z

So, if I understand correctly, if in my POMDP the reward depends only on s_i and a_i, then it should work?
Also, as far as I understand, this is only a problem of the parser, right? For the solvers it doesn't matter?

Yes, you are correct. The only supported POMDP reward specification is like:
R: my-action : current-state : * : * 100

I couldn't remember exactly whether the parser already supports the full (s_i, s_j, o) spec... maybe it already does but it is just I chose to not consider it in the solver...

friedsela · 2018-12-20T09:15:20Z

Ok, great! I actually have my own implementation for POMCP, but as I need to solve many models, I wanted to see weather there is a faster implementation than my own. I also wanted to see weather point-based is faster than POMCP. From your experience, which one is faster?

namoshizun · 2018-12-20T09:28:48Z

Hmmm I don't really think they are even comparable as they are solving different POMDPs. POMCP is designed for solving online planning problems in POMDP, whereas PBVI is for off-line learning. It also depends on the model complexity, POMCP was designed for approximately solving very large POMDPs, so the model complexity should be made explicit if a comparison is must... Previously, I didn't consider comparing their speed.

friedsela · 2018-12-22T20:59:52Z

Thanks for your reply.

Fortunately in my model the rewards are only dependent on the action and the current state, so i'm able now to run your code. Could you please explain what is actually happening and what the output mean?
For example, I don't understand:

Why do you call 'pomdp.solve(T)' in each iteration? Don't you solve the POMDP once and then use its policy?

Why are the simulations numbers jumping? what do they mean?

What does '30 games played. Toal reward = 10.0' mean? That my POMDP was solved and 30 games were played and the mean is 10? If the horizon is 20 and 30 games were played, what then happend?

What I am looking for is to run a solver and then get a policy for which I can calculate the mean reward (by simulating many games). Is this possible with your code?

namoshizun · 2018-12-24T06:22:26Z

Thanks for your reply.

Fortunately in my model the rewards are only dependent on the action and the current state, so i'm able now to run your code. Could you please explain what is actually happening and what the output mean?
For example, I don't understand:

Why do you call 'pomdp.solve(T)' in each iteration? Don't you solve the POMDP once and then use its policy?

For an off-line learning POMDP problem, you only need to solve the POMDP once to get an (maybe) optimal policy, then use that policy to decide all actions. But in an online setting (which is POMCP's case of application), you need to learn policies as you go along because the agent doesn't have access to the full POMDP specification --- kinda like walking in a maze where you could only see what's nearby...

Why are the simulations numbers jumping? what do they mean?

It doesn't have to be jumping. You can have a fixed number of MCTS simulations, but I just decided to have a fixed number of total simulation time lol.

What does '30 games played. Toal reward = 10.0' mean? That my POMDP was solved and 30 games were played and the mean is 10? If the horizon is 20 and 30 games were played, what then happend?

That means it has completed 30 times of the POMDP cycle: planning => action => observation => belief update.

What I am looking for is to run a solver and then get a policy for which I can calculate the mean reward (by simulating many games). Is this possible with your code?

I reckon currently it is only possible for PBVI..

friedsela · 2018-12-24T18:27:28Z

Thank for the reply. As far as I understand, every online solver can be made offline if we let it run on a simulator until it gets to an (maybe) optimal policy and this is how I use POMCP. I guess it's not how it was meant to be applied.

So suppose I want to use your PBVI solver.
If I set the horizon to, say, 20, and let max_play be 30, I don't see that it resets to begin a new game. I find it confusing how you use max_play. If we have a certain horizon H , then one play should be H steps. I think it is not interesting to count plays (in the sense of steps), as you do it, but rather full games of full H horizon steps.
Last question (for now): I set the horizon manually before 'for i in range(params.max_play):' because I don't see where you set it as a parameter. Shouldn't it be one of the 'parser.add_argument' in main?

Hope it's fine me asking so many questions and suggesting things, it is that I think your project is very good and that the community lacks good python implementations of POMDP solvers...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Own POMDP model #1

Own POMDP model #1

friedsela commented Dec 12, 2018 •

edited

Loading

namoshizun commented Dec 13, 2018

friedsela commented Dec 14, 2018 •

edited

Loading

namoshizun commented Dec 18, 2018 •

edited

Loading

friedsela commented Dec 19, 2018

namoshizun commented Dec 20, 2018

friedsela commented Dec 20, 2018 •

edited

Loading

namoshizun commented Dec 20, 2018 •

edited

Loading

friedsela commented Dec 22, 2018 •

edited

Loading

namoshizun commented Dec 24, 2018 •

edited

Loading

friedsela commented Dec 24, 2018 •

edited

Loading

Own POMDP model #1

Own POMDP model #1

Comments

friedsela commented Dec 12, 2018 • edited Loading

namoshizun commented Dec 13, 2018

friedsela commented Dec 14, 2018 • edited Loading

namoshizun commented Dec 18, 2018 • edited Loading

friedsela commented Dec 19, 2018

namoshizun commented Dec 20, 2018

friedsela commented Dec 20, 2018 • edited Loading

namoshizun commented Dec 20, 2018 • edited Loading

friedsela commented Dec 22, 2018 • edited Loading

namoshizun commented Dec 24, 2018 • edited Loading

friedsela commented Dec 24, 2018 • edited Loading

friedsela commented Dec 12, 2018 •

edited

Loading

friedsela commented Dec 14, 2018 •

edited

Loading

namoshizun commented Dec 18, 2018 •

edited

Loading

friedsela commented Dec 20, 2018 •

edited

Loading

namoshizun commented Dec 20, 2018 •

edited

Loading

friedsela commented Dec 22, 2018 •

edited

Loading

namoshizun commented Dec 24, 2018 •

edited

Loading

friedsela commented Dec 24, 2018 •

edited

Loading