Updates
- We ran a very successful MIRI Summer Fellows Program, which included a day where participants publicly wrote up their thoughts on various AI safety topics. See Ben Pace’s first post in a series of roundups.
- A few highlights from the writing day: Adele Lopez's Optimization Provenance; Daniel Kokotajlo's Soft Takeoff Can Still Lead to Decisive Strategic Advantage and The "Commitment Races" Problem; Evan Hubinger's Towards a Mechanistic Understanding of Corrigibility; and John Wentworth's Markets are Universal for Logical Induction and Embedded Agency via Abstraction.
- New posts from MIRI staff and interns: Abram Demski's Troll Bridge; Matthew Graves' View on Factored Cognition; Daniel Filan's Verification and Transparency; and Scott Garrabrant's Intentional Bucket Errors and Does Agent-like Behavior Imply Agent-like Architecture?
- See also a forum discussion on "proof-level guarantees" in AI safety.
News and links
- From Ben Cottier and Rohin Shah: Clarifying Some Key Hypotheses in AI Alignment
- Classifying Specification Problems as Variants of Goodhart's Law: Victoria Krakovna and Ramana Kumar relate DeepMind's SRA taxonomy to mesa-optimizers, selection and control, and Scott Garrabrant's Goodhart taxonomy. Also new from DeepMind: Ramana, Tom Everitt, and Marcus Hutter's Designing Agent Incentives to Avoid Reward Tampering.
- From OpenAI: Testing Robustness Against Unforeseen Adversaries. 80,000 Hours also recently interviewed OpenAI's Paul Christiano, with some additional material on decision theory.
- From AI Impacts: Evidence Against Current Methods Leading to Human-Level AI and Ernie Davis on the Landscape of AI Risks
- From Wei Dai: Problems in AI Alignment That Philosophers Could Potentially Contribute To
- Richard Möhn has put together a calendar of upcoming AI alignment events.
- The Berkeley Existential Risk Initiative is seeking an Operations Manager.