S
- the state type.A
- the action type.public class QLearningAgent<S,A extends Action> extends ReinforcementAgent<S,A>
function Q-LEARNING-AGENT(percept) returns an action inputs: percept, a percept indicating the current state s' and reward signal r' persistent: Q, a table of action values indexed by state and action, initially zero Nsa, a table of frequencies for state-action pairs, initially zero s,a,r, the previous state, action, and reward, initially null if TERMAINAL?(s) then Q[s,None] <- r' if s is not null then increment Nsa[s,a] Q[s,a] <- Q[s,a] + α(Nsa[s,a])(r + γmaxa'Q[s',a'] - Q[s,a]) s,a,r <- s',argmaxa'f(Q[s',a'],Nsa[s',a']),r' return aFigure 21.8 An exploratory Q-learning agent. It is an active learner that learns the value Q(s,a) of each action in each situation. It uses the same exploration function f as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.
if s'.TERMINAL? then s,a,r <- null else s,a,r <- s',argmaxa'f(Q[s',a'],Nsa[s',a']),r'otherwise at the beginning of a consecutive trial, s will be the prior terminal state and is what will be updated in Q[s,a], which appears not to be correct as you did not perform an action in the terminal state and the initial state is not reachable from the prior terminal state. Comments welcome.
program
Constructor and Description |
---|
QLearningAgent(ActionsFunction<S,A> actionsFunction,
A noneAction,
double alpha,
double gamma,
int Ne,
double Rplus)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected double |
alpha(FrequencyCounter<Pair<S,A>> Nsa,
S s,
A a)
AIMA3e pg.
|
A |
execute(PerceptStateReward<S> percept)
An exploratory Q-learning agent.
|
protected double |
f(java.lang.Double u,
int n)
AIMA3e pg.
|
java.util.Map<S,java.lang.Double> |
getUtility()
Get a vector of the currently calculated utilities for states of type S
in the world.
|
void |
reset()
Reset the agent back to its initial state before it has learned anything
about its environment.
|
execute
isAlive, setAlive
public QLearningAgent(ActionsFunction<S,A> actionsFunction, A noneAction, double alpha, double gamma, int Ne, double Rplus)
actionsFunction
- a function that lists the legal actions from a state.noneAction
- an action representing None, i.e. a NoOp.alpha
- a fixed learning rate.gamma
- discount to be used.Ne
- is fixed parameter for use in the method f(u, n).Rplus
- R+ is an optimistic estimate of the best possible reward
obtainable in any state, which is used in the method f(u, n).public A execute(PerceptStateReward<S> percept)
execute
in class ReinforcementAgent<S,A extends Action>
percept
- a percept indicating the current state s' and reward signal
r'.public void reset()
ReinforcementAgent
reset
in class ReinforcementAgent<S,A extends Action>
public java.util.Map<S,java.lang.Double> getUtility()
ReinforcementAgent
getUtility
in class ReinforcementAgent<S,A extends Action>
protected double alpha(FrequencyCounter<Pair<S,A>> Nsa, S s, A a)
Nsa
- a frequency counter of observed state action pairs.s
- the current state.a
- the current action.protected double f(java.lang.Double u, int n)
u
- the currently estimated utility.n
- the number of times this situation has been encountered.