MIRI senior researcher Scott Garrabrant has a major new result, “Finite Factored Sets,” that he’ll be unveiling in an online talk this Sunday at noon Pacific time. (Zoom link.) For context on the result, see Scott’s new post “Saving Time.”
In other big news, MIRI has just received its two largest individual donations of all time! Ethereum inventor Vitalik Buterin has donated ~$4.3 million worth of ETH to our research program, while an anonymous long-time supporter has donated MKR tokens we liquidated for an astounding ~$15.6 million. The latter donation is restricted so that we can spend a maximum of $2.5 million of it per year until 2025, like a multi-year grant.
Both donors have our massive thanks for these incredible gifts to support our work!
Other MIRI updates
- Mark Xu and Evan Hubinger use “Cartesian world models” to distinguish “consequential agents” (which assign utility to environment states, internal states, observations, and/or actions) “structural agents” (which optimize “over the set of possible decide functions instead of the set of possible actions”), and “conditional agents” (which map e.g. environmental states to utility functions, rather than mapping them to utility).
- In Gradations of Inner Alignment Obstacles, Abram Demski makes three “contentious claims”:
- The most useful definition of “mesa-optimizer” doesn’t require them to perform explicit search, contrary to the current standard.
- Success at aligning narrowly superhuman models might be bad news.
- Some versions of the lottery ticket hypothesis seem to imply that randomly initialized networks already contain deceptive agents.
- Eliezer Yudkowsky comments on the relationship between early AGI systems’ alignability and capabilities.
News and links
- John Wentworth announces a project to test the natural abstraction hypothesis, which asserts that “most high-level abstract concepts used by humans are ‘natural'” and therefore “a wide range of architectures will reliably learn similar high-level concepts”.
- Open Philanthropy’s Joe Carlsmith asks “Is Power-Seeking AI an Existential Risk?“, and Luke Muehlhauser asks for examples of treacherous turns in the wild (also on LessWrong).
- From DeepMind’s safety researchers: What Mechanisms Drive Agent Behavior?, Alignment of Language Agents, and An EPIC Way to Evaluate Reward Functions. Also, Rohin Shah provides his advice on entering the field.
- Owen Shen and Peter Hase summarize 70 recent papers on model transparency, interpretability, and explainability.
- Eli Tyre asks: How do we prepare for final crunch time? (I would add some caveats: Some roles and scenarios imply that you’ll have less impact on the eve of AGI, and can have far more impact today. For some people, “final crunch time” may be now, and marginal efforts matter less later. Further, some forms of “preparing for crunch time” will fail if there aren’t clear warning shots or fire alarms.)
- Paul Christiano launches a new organization that will be his focus going forward: the Alignment Research Center. Learn more about Christiano’s research approach in My Research Methodology and in his recent AMA.