function POLICY-ITERATION(mdp) returns a policy
inputs: mdp, an MDP with states S, actions A(s), transition model P(s′ | s, a)
local variables: U, a vector of utilities for states in S, initially zero
π, a policy vector indexed by state, initially random
repeat
U ← POLICY-EVALUATION(π, U, mdp)
unchanged? ← true
for each state s in S do
if maxa ∈ A(s) Σs′ P(s′ | s, a) U[s′] > Σs′ P(s′ | s, π[s]) U[s′] then do
π[s] ← argmaxa ∈ A(s) Σs′ P(s′ | s, a) U[s′]
unchanged? ← false
until unchanged?
return π
Figure ?? The policy iteration algorithm for calculating an optimal policy.