- MIRI researcher Evan Hubinger discusses learned optimization, interpretability, and homogeneity in takeoff speeds on the Inside View podcast.
- Scott Garrabrant releases part three of "Finite Factored Sets", on conditional orthogonality.
- UC Berkeley's Daniel Filan provides examples of conditional orthogonality in finite factored sets: 1, 2.
- Abram Demski proposes factoring the alignment problem into "outer alignment" / "on-distribution alignment", "inner robustness" / "capability robustness", and "objective robustness" / "inner alignment".
- MIRI senior researcher Eliezer Yudkowsky summarizes "the real core of the argument for 'AGI risk' (AGI ruin)" as "appreciating the power of intelligence enough to realize that getting superhuman intelligence wrong, on the first try, will kill you on that first try, not let you learn and try again".
News and links
- From DeepMind: "generally capable agents emerge from open-ended play".
- DeepMind’s safety team summarizes their work to date on causal influence diagrams.
- Another (outer) alignment failure story is similar to Paul Christiano's best guess at how AI might cause human extinction.
- Christiano discusses a "special case of alignment: solve alignment when decisions are 'low stakes'".
- Andrew Critch argues that power dynamics are "a blind spot or blurry spot" in the collective world-modeling of the effective altruism and rationality communities, "especially around AI".