Abram Demski and Scott Garrabrant have made a major update to "Embedded Agency", with new discussions of ε-exploration, Newcomblike problems, reflective oracles, logical uncertainty, Goodhart's law, and predicting rare catastrophes, among other topics.
Abram has also written an overview of what good reasoning looks in the absence of Bayesian updating: Radical Probabilism. One recurring theme:
[I]n general (i.e., without any special prior which does guarantee convergence for restricted observation models), a Bayesian relies on a realizability (aka grain-of-truth) assumption for convergence, as it does for some other nice properties. Radical probabilism demands these properties without such an assumption.
[… C]onvergence points at a notion of "objectivity" for the radical probabilist. Although the individual updates a radical probabilist makes can go all over the place, the beliefs must eventually settle down to something. The goal of reasoning is to settle down to that answer as quickly as possible.
Meanwhile, Infra-Bayesianism is a new formal framework for thinking about optimal reasoning without requiring an reasoner's true environment to be in its hypothesis space. Abram comments: "Alex Appel and Vanessa Kosoy have been working hard at 'Infra-Bayesianism', a new approach to RL which aims to make it easier (ie, possible) to prove safety-relevant theorems (and, also, a new approach to Bayesianism more generally).
Other MIRI updates
- Abram Demski writes a parable on the differences between logical inductors and Bayesians: The Bayesian Tyrant.
- Building on the selection vs. control distinction, Abram contrasts "mesa-search" and "mesa-control".
News and links
- From OpenAI's Stiennon et al.: Learning to Summarize with Human Feedback. MIRI researcher Eliezer Yudkowsky comments:
A very rare bit of research that is directly, straight-up relevant to real alignment problems! They trained a reward function on human preferences and then measured how hard you could optimize against the trained function before the results got actually worse.
[… Y]ou can ask for results as good as the best 99th percentile of rated stuff in the training data (a la Jessica Taylor's quantilization idea). Ask for things the trained reward function rates as "better" than that, and it starts to find "loopholes" as seen from outside the system; places where the trained reward function poorly matches your real preferences, instead of places where your real preferences would rate high reward.
- Chi Nguyen writes up an introduction to Paul Christiano's iterated amplification research agenda that seeks to be the first such resource that is "both easy to understand and [gives] a complete picture". The post includes inline comments by Christiano.
- Forecasters share visualizations of their AI timelines on LessWrong.