S - the state type.A - the action type.public class PolicyIteration<S,A extends Action>
extends java.lang.Object
function POLICY-ITERATION(mdp) returns a policy
inputs: mdp, an MDP with states S, actions A(s), transition model P(s' | s, a)
local variables: U, a vector of utilities for states in S, initially zero
π, a policy vector indexed by state, initially random
repeat
U <- POLICY-EVALUATION(π, U, mdp)
unchanged? <- true
for each state s in S do
if maxa ∈ A(s) Σs'P(s'|s,a)U[s'] > Σs'P(s'|s,π[s])U[s'] then do
π[s] <- argmaxa ∈ A(s) Σs'P(s'|s,a)U[s']
unchanged? <- false
until unchanged?
return π
Figure 17.7 The policy iteration algorithm for calculating an optimal policy.| Constructor and Description |
|---|
PolicyIteration(PolicyEvaluation<S,A> policyEvaluation)
Constructor.
|
| Modifier and Type | Method and Description |
|---|---|
static <S,A extends Action> |
initialPolicyVector(MarkovDecisionProcess<S,A> mdp)
Create a policy vector indexed by state, initially random.
|
Policy<S,A> |
policyIteration(MarkovDecisionProcess<S,A> mdp)
The policy iteration algorithm for calculating an optimal policy.
|
public PolicyIteration(PolicyEvaluation<S,A> policyEvaluation)
policyEvaluation - the policy evaluation function to use.public Policy<S,A> policyIteration(MarkovDecisionProcess<S,A> mdp)
mdp - an MDP with states S, actions A(s), transition model P(s'|s,a)public static <S,A extends Action> java.util.Map<S,A> initialPolicyVector(MarkovDecisionProcess<S,A> mdp)
mdp - an MDP with states S, actions A(s), transition model P(s'|s,a)