gptkbp:instanceOf
|
reinforcement learning algorithm
|
gptkbp:actionVariable
|
a
|
gptkbp:appliesTo
|
gptkb:game_AI
autonomous systems
robotics
|
gptkbp:canBe
|
tabular
function approximation
|
gptkbp:category
|
on-policy algorithm
|
gptkbp:citation
|
Rummery, G.A. & Niranjan, M. (1994). On-line Q-learning using connectionist systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department.
|
gptkbp:convergesTo
|
optimal policy (under certain conditions)
|
gptkbp:differential
|
SARSA is on-policy, Q-learning is off-policy
|
gptkbp:distinctFrom
|
gptkb:Q-learning
|
gptkbp:explores
|
environment
|
gptkbp:firstDescribed
|
gptkb:Rummery_and_Niranjan
1996
|
gptkbp:fullName
|
gptkb:State-Action-Reward-State-Action
|
https://www.w3.org/2000/01/rdf-schema#label
|
SARSA
|
gptkbp:learns
|
action-value function
|
gptkbp:parameter
|
gptkb:public_policy
discount factor
learning rate
exploration rate
|
gptkbp:policy
|
ε-greedy
|
gptkbp:relatedTo
|
gptkb:Q-learning
|
gptkbp:rewardSignal
|
r
|
gptkbp:stateVariable
|
s
|
gptkbp:successor
|
s'
|
gptkbp:successorAction
|
a'
|
gptkbp:updated
|
Q-values
|
gptkbp:updateRule
|
Q(s,a) ← Q(s,a) + α [r + γ Q(s',a') − Q(s,a)]
|
gptkbp:usedFor
|
gptkb:Markov_Decision_Processes
|
gptkbp:usedIn
|
gptkb:artificial_intelligence
gptkb:machine_learning
|
gptkbp:bfsParent
|
gptkb:Q-learning
gptkb:reinforcement_learning
|
gptkbp:bfsLayer
|
5
|