Statements (19)
| Predicate | Object |
|---|---|
| gptkbp:instanceOf |
gptkb:algorithm
|
| gptkbp:advantage |
improves sample efficiency
reduces reward model overfitting |
| gptkbp:application |
large language models
|
| gptkbp:contrastsWith |
gptkb:DPO
PPO |
| gptkbp:field |
gptkb:machine_learning
gptkb:reinforcement_learning |
| gptkbp:fullName |
Direct Preference Optimization with Policy Optimization
|
| gptkbp:introducedIn |
2023
|
| gptkbp:method |
directly optimizes policy using preference data
|
| gptkbp:notablePublication |
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
|
| gptkbp:proposedBy |
gptkb:Microsoft_Research
|
| gptkbp:purpose |
aligning language models with human preferences
|
| gptkbp:relatedTo |
gptkb:RLHF
preference optimization |
| gptkbp:bfsParent |
gptkb:Defense_Dissemination_Program_Office
|
| gptkbp:bfsLayer |
8
|
| https://www.w3.org/2000/01/rdf-schema#label |
DDPO
|