Statements (27)
Predicate | Object |
---|---|
gptkbp:instanceOf |
reinforcement learning algorithm
|
gptkbp:application |
control systems
resource management game playing autonomous navigation |
gptkbp:citation |
gptkb:Watkins,_C.J.C.H._(1989)._Learning_from_Delayed_Rewards._PhD_thesis,_University_of_Cambridge.
|
gptkbp:combines |
deep learning
|
gptkbp:developedBy |
gptkb:Christopher_Watkins
|
gptkbp:goal |
find optimal action-selection policy
|
https://www.w3.org/2000/01/rdf-schema#label |
Q-learning algorithm
|
gptkbp:introducedIn |
1989
|
gptkbp:learns |
Q-values
|
gptkbp:parameter |
discount factor
learning rate exploration rate |
gptkbp:Q-valuesRepresent |
expected utility of actions
|
gptkbp:relatedTo |
gptkb:SARSA
temporal difference learning |
gptkbp:type |
off-policy
model-free |
gptkbp:updateRule |
Q(s,a) ← Q(s,a) + α [r + γ max Q(s',a') - Q(s,a)]
|
gptkbp:usedIn |
gptkb:artificial_intelligence
gptkb:machine_learning robotics |
gptkbp:variant |
gptkb:Deep_Q-Network_(DQN)
|
gptkbp:bfsParent |
gptkb:Christopher_J.C.H._Watkins
|
gptkbp:bfsLayer |
7
|