| gptkbp:instanceOf | gptkb:Reinforcement_learning_algorithm 
 | 
                        
                            
                                | gptkbp:application | gptkb:robot Game playing
 Autonomous control
 
 | 
                        
                            
                                | gptkbp:category | Temporal difference learning 
 | 
                        
                            
                                | gptkbp:compatibleWith | Model of environment 
 | 
                        
                            
                                | gptkbp:convergesTo | Optimal policy (under certain conditions) 
 | 
                        
                            
                                | gptkbp:explorationStrategy | Epsilon-greedy 
 | 
                        
                            
                                | gptkbp:field | gptkb:artificial_intelligence Machine learning
 
 | 
                        
                            
                                | gptkbp:form | gptkb:Markov_chain 
 | 
                        
                            
                                | gptkbp:goal | Learn optimal action-selection policy 
 | 
                        
                            
                                | gptkbp:influenced | Deep Q-Learning 
 | 
                        
                            
                                | gptkbp:input | gptkb:action gptkb:state_order
 
 | 
                        
                            
                                | gptkbp:introduced | gptkb:Christopher_Watkins 
 | 
                        
                            
                                | gptkbp:introducedIn | 1989 
 | 
                        
                            
                                | gptkbp:output | Q-value 
 | 
                        
                            
                                | gptkbp:relatedTo | gptkb:Deep_Q-Network gptkb:SARSA
 
 | 
                        
                            
                                | gptkbp:rewardSignal | Reinforcement signal 
 | 
                        
                            
                                | gptkbp:type | Model-free algorithm Off-policy algorithm
 
 | 
                        
                            
                                | gptkbp:updateParameter | Discount factor Learning rate
 
 | 
                        
                            
                                | gptkbp:updateRule | gptkb:Bellman_equation 
 | 
                        
                            
                                | gptkbp:usedIn | Resource management Autonomous navigation
 Atari game agents
 
 | 
                        
                            
                                | gptkbp:uses | Q-values 
 | 
                        
                            
                                | gptkbp:bfsParent | gptkb:Temporal_Difference_Learning 
 | 
                        
                            
                                | gptkbp:bfsLayer | 7 
 | 
                        
                            
                                | https://www.w3.org/2000/01/rdf-schema#label | Q-Learning 
 |