Statements (19)
Predicate | Object |
---|---|
gptkbp:instanceOf |
gptkb:algorithm
|
gptkbp:advantage |
improves sample efficiency
reduces reward model overfitting |
gptkbp:application |
large language models
|
gptkbp:contrastsWith |
gptkb:DPO
PPO |
gptkbp:field |
gptkb:machine_learning
gptkb:reinforcement_learning |
gptkbp:fullName |
Direct Preference Optimization with Policy Optimization
|
https://www.w3.org/2000/01/rdf-schema#label |
DDPO
|
gptkbp:introducedIn |
2023
|
gptkbp:method |
directly optimizes policy using preference data
|
gptkbp:notablePublication |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
|
gptkbp:proposedBy |
gptkb:Microsoft_Research
|
gptkbp:purpose |
aligning language models with human preferences
|
gptkbp:relatedTo |
gptkb:RLHF
preference optimization |
gptkbp:bfsParent |
gptkb:Defense_Dissemination_Program_Office
|
gptkbp:bfsLayer |
8
|