gptkbp:instanceOf
|
reinforcement learning algorithm
|
gptkbp:actionSpace
|
continuous
|
gptkbp:appliesTo
|
continuous action spaces
|
gptkbp:basedOn
|
actor-critic architecture
|
gptkbp:citation
|
over 5000
|
gptkbp:developedBy
|
gptkb:Pieter_Abbeel
gptkb:Sergey_Levine
Aurick Zhou
Tuomas Haarnoja
|
gptkbp:expedition
|
encouraged by entropy maximization
|
gptkbp:hasVariant
|
Discrete SAC
Multi-Agent SAC
|
gptkbp:heldBy
|
off-policy
model-free
deep reinforcement learning algorithm
|
https://www.w3.org/2000/01/rdf-schema#label
|
Soft Actor-Critic (SAC)
|
gptkbp:hyperparameter
|
learning rate
discount factor (gamma)
target smoothing coefficient (tau)
temperature parameter (alpha)
|
gptkbp:improves
|
gptkb:Proximal_Policy_Optimization_(PPO)
gptkb:Trust_Region_Policy_Optimization_(TRPO)
Deep Deterministic Policy Gradient (DDPG)
Twin Delayed DDPG (TD3)
|
gptkbp:introducedIn
|
2018
|
gptkbp:openSource
|
gptkb:park
gptkb:RLlib
gptkb:Stable_Baselines3
gptkb:OpenAI_Spinning_Up
|
gptkbp:optimizedFor
|
expected reward
policy entropy
|
gptkbp:policy
|
stochastic
|
gptkbp:publishedIn
|
arXiv:1801.01290
|
gptkbp:relatedTo
|
gptkb:Deep_Q-Network_(DQN)
Policy Gradient Methods
Maximum Entropy RL
|
gptkbp:rewardFunction
|
augmented with entropy term
|
gptkbp:robustness
|
high
|
gptkbp:sampleEfficiency
|
high
|
gptkbp:stable
|
high
|
gptkbp:title
|
gptkb:Soft_Actor-Critic:_Off-Policy_Maximum_Entropy_Deep_Reinforcement_Learning_with_a_Stochastic_Actor
|
gptkbp:usedIn
|
robotics
autonomous driving
simulated control tasks
|
gptkbp:uses
|
value function
Q-function
replay buffer
target networks
maximum entropy reinforcement learning
stochastic policy
|
gptkbp:bfsParent
|
gptkb:Trust_Region_Policy_Optimization_(TRPO)
|
gptkbp:bfsLayer
|
7
|