function Passive-ADP-Agent(percept) returns and action
inputs: percept, a percept indication the current state s' and reward signal r'
persistent: π, a fixed policy
mdp, an MDP with model P, rewards R, discount γ
U, a table of utilities, initially empty
Nsa, a table of frequencies for state-action pairs, initially zero
Ns'|sa, a table of outcome frequencies given state-action pairs, initially zero
s, a, the previous state and action, initially null
if s' is new then U[s'] ← r'; R[s'] ← r'
if s is not null then
increment Nsa[s, a] and Ns'|sa[s', s, a]
for each t such that Ns'|sa[t, s, a] is nonzero do
P(t | s, a) ← Ns'|sa[t, s, a] / Nsa[s, a]
U ← Policy-Evaluation(π, U, mdp)
if s'.Terminal? then s, a ← null else s, a ← s', π[s']
Figure ?? A passive reinforcement learning agent based on adaptive dynamic programming. The Policy-Evaluation function solves the fixed-policy Bellman equations, as described on page ??.