Statements (73)
Predicate | Object |
---|---|
gptkbp:instanceOf |
research
|
gptkbp:focusesOn |
decision theory
robustness AI alignment corrigibility embedded agency value learning |
gptkbp:goal |
ensure safe AI systems
formalize agent reasoning understand agent behavior |
https://www.w3.org/2000/01/rdf-schema#label |
Agent Foundations
|
gptkbp:includes |
gptkb:football
gptkb:AI_control_problem gptkb:AI_safety_via_debate gptkb:Goodhart's_law game theory self-reference multi-agent systems transparency bounded rationality cooperation inverse reinforcement learning interpretability utility functions logical uncertainty instrumental convergence AI boxing goal specification AI corrigibility AI goal stability AI incentives AI motivation AI takeoff dynamics Vingean reflection acausal trade alignment tax causal decision theory corrigibility problems counterfactual reasoning decision theory paradoxes embedded agency problems evidential decision theory functional decision theory impact regularization inner alignment instrumental goals logical induction logical omniscience mesa-optimization naturalized induction ontological crises outer alignment preference learning recursive self-improvement reflective oracles reward hacking reward modeling reward specification robust delegation self-modification shutdown problem subagent problems tripwire mechanisms updateless decision theory value extrapolation wireheading |
gptkbp:relatedTo |
gptkb:artificial_intelligence
gptkb:machine_learning |
gptkbp:studiedBy |
gptkb:MIRI
academic researchers AI safety community |
gptkbp:bfsParent |
gptkb:MIRI
|
gptkbp:bfsLayer |
5
|