QLearningAgent

java.lang.Object
- aima.core.agent.impl.AbstractAgent
- - aima.core.learning.reinforcement.agent.ReinforcementAgent<S,A>
  - - aima.core.learning.reinforcement.agent.QLearningAgent<S,A>

Type Parameters:

S - the state type.

A - the action type.

All Implemented Interfaces:

Agent, EnvironmentObject
```
public class QLearningAgent<S,A extends Action>
extends ReinforcementAgent<S,A>
```
Artificial Intelligence A Modern Approach (3rd Edition): page 844.
```
 function Q-LEARNING-AGENT(percept) returns an action
   inputs: percept, a percept indicating the current state s' and reward signal r'
   persistent: Q, a table of action values indexed by state and action, initially zero
               N_sa, a table of frequencies for state-action pairs, initially zero
               s,a,r, the previous state, action, and reward, initially null
               
   if TERMAINAL?(s) then Q[s,None] <- r'
   if s is not null then
       increment N_sa[s,a]
       Q[s,a] <- Q[s,a] + α(N_sa[s,a])(r + γmax_a'Q[s',a'] - Q[s,a])
   s,a,r <- s',argmax_a'f(Q[s',a'],N_sa[s',a']),r'
   return a
 
```
Figure 21.8 An exploratory Q-learning agent. It is an active learner that learns the value Q(s,a) of each action in each situation. It uses the same exploration function f as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.

Note: There appears to be two minor defects in the algorithm outlined in the book:
if TERMAINAL?(s) then Q[s,None] <- r'
should be:
if TERMAINAL?(s') then Q[s',None] <- r'
so that the correct value for Q[s',a'] is used in the Q[s,a] update rule when a terminal state is reached.

s,a,r <- s',argmax_a'f(Q[s',a'],N_sa[s',a']),r'
should be:
```
 if s'.TERMINAL? then s,a,r <- null else s,a,r <- s',argmax_a'f(Q[s',a'],N_sa[s',a']),r'
 
```
otherwise at the beginning of a consecutive trial, s will be the prior terminal state and is what will be updated in Q[s,a], which appears not to be correct as you did not perform an action in the terminal state and the initial state is not reachable from the prior terminal state. Comments welcome.
Author:

Ciaran O'Reilly, Ravi Mohan

Field Summary
- Fields inherited from class aima.core.agent.impl.AbstractAgent
  program

Constructor Summary

Constructors
Constructor and Description
`QLearningAgent(ActionsFunction<S,A> actionsFunction, A noneAction, double alpha, double gamma, int Ne, double Rplus)` Constructor.

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`protected double`	`alpha(FrequencyCounter<Pair<S,A>> Nsa, S s, A a)` AIMA3e pg.
`A`	`execute(PerceptStateReward<S> percept)` An exploratory Q-learning agent.
`protected double`	`f(java.lang.Double u, int n)` AIMA3e pg.
`java.util.Map<S,java.lang.Double>`	`getUtility()` Get a vector of the currently calculated utilities for states of type S in the world.
`void`	`reset()` Reset the agent back to its initial state before it has learned anything about its environment.

Methods inherited from class aima.core.learning.reinforcement.agent.ReinforcementAgent
execute

Methods inherited from class aima.core.agent.impl.AbstractAgent
isAlive, setAlive

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - QLearningAgent
```
public QLearningAgent(ActionsFunction<S,A> actionsFunction,
                      A noneAction,
                      double alpha,
                      double gamma,
                      int Ne,
                      double Rplus)
```
    Constructor.
    
    Parameters:
    
    actionsFunction - a function that lists the legal actions from a state.
    
    noneAction - an action representing None, i.e. a NoOp.
    
    alpha - a fixed learning rate.
    
    gamma - discount to be used.
    
    Ne - is fixed parameter for use in the method f(u, n).
    
    Rplus - R+ is an optimistic estimate of the best possible reward obtainable in any state, which is used in the method f(u, n).
- Method Detail
  - execute
```
public A execute(PerceptStateReward<S> percept)
```
    An exploratory Q-learning agent. It is an active learner that learns the value Q(s,a) of each action in each situation. It uses the same exploration function f as the exploratory ADP agent, but avoids having to learn the transition model because the Q-value of a state can be related directly to those of its neighbors.
    
    Specified by:
    
    execute in class ReinforcementAgent<S,A extends Action>
    
    Parameters:
    
    percept - a percept indicating the current state s' and reward signal r'.
    
    Returns:
    
    an action
  - reset
```
public void reset()
```
    Description copied from class: ReinforcementAgent
    
    Reset the agent back to its initial state before it has learned anything about its environment.
    
    Specified by:
    
    reset in class ReinforcementAgent<S,A extends Action>
  - getUtility
```
public java.util.Map<S,java.lang.Double> getUtility()
```
    Description copied from class: ReinforcementAgent
    
    Get a vector of the currently calculated utilities for states of type S in the world.
    
    Specified by:
    
    getUtility in class ReinforcementAgent<S,A extends Action>
    
    Returns:
    
    a Map of the currently learned utility values for the states in the environment (Note: this map may not contain all of the states in the environment, i.e. the agent has not seen them yet).
  - alpha
```
protected double alpha(FrequencyCounter<Pair<S,A>> Nsa,
                       S s,
                       A a)
```
    AIMA3e pg. 836 'if we change α from a fixed parameter to a function that decreases as the number of times a state action has been observed increases, then U^π(s) itself will converge to the correct value.
    
    Note: override this method to obtain the desired behavior.
    
    Parameters:
    
    Nsa - a frequency counter of observed state action pairs.
    
    s - the current state.
    
    a - the current action.
    
    Returns:
    
    the learning rate to use based on the frequency of the state passed in.
  - f
```
protected double f(java.lang.Double u,
                   int n)
```
    AIMA3e pg. 842 'f(u, n) is called the exploration function. It determines how greed (preferences for high values of u) is traded off against curiosity (preferences for actions that have not been tried often and have low n). The function f(u, n) should be increasing in u and decreasing in n. Note: Override this method to obtain desired behavior.
    
    Parameters:
    
    u - the currently estimated utility.
    
    n - the number of times this situation has been encountered.
    
    Returns:
    
    the exploration value.

Class QLearningAgent<S,A extends Action>

Field Summary

Fields inherited from class aima.core.agent.impl.AbstractAgent

Constructor Summary

Method Summary

Methods inherited from class aima.core.learning.reinforcement.agent.ReinforcementAgent

Methods inherited from class aima.core.agent.impl.AbstractAgent

Methods inherited from class java.lang.Object

Constructor Detail

QLearningAgent

Method Detail

execute

reset

getUtility

alpha

f