MIRI updates
- Three questions from MIRI's Abram Demski: What does it mean to apply decision theory?, How “honest” is GPT-3?, and How should AI debate be judged?
- A transcript from MIRI researcher Scott Garrabrant: What Would I Do? Self-Prediction in Simple Algorithms.
- MIRI researcher Buck Shlegeris reviews the debate on what the history of nuclear weapons implies about humanity's ability to coordinate.
- From MIRI's Evan Hubinger: Learning the Prior and Generalization and Alignment Proposals and Complexity Classes.
- Rafael Harth's Inner Alignment: Explain Like I'm 12 Edition summarizes the concepts and takeaways from “Risks from Learned Optimization”.
- Issa Rice reviews discussion to date on MIRI's research focus, “To what extent is it possible to have a precise theory of rationality?”, and the relationship between deconfusion research and safety outcomes. (Plus a short reply.)
- “Pitfalls of Learning a Reward Function Online” (IJCAI paper, LW summary): FHI researcher and MIRI research associate Stuart Armstrong, with DeepMind's Jan Leike, Laurent Orseau, and Shane Legg, explore ways to discourage agents from manipulating their reward signal to be easier to optimize.
News and links
- From Paul Christiano: Learning the Prior and Better Priors as a Safety Problem.
- From Victoria Krakovna: Tradeoff Between Desirable Properties for Baseline Choices in Impact Measures.
- Ben Pace summarizes Christiano's “What Failure Looks Like” post and the resultant discussion.
- Kaj Sotala collects recent examples of experiences from people working with GPT-3.