January 2020 Newsletter
Updates
- Our 2019 fundraiser ended Dec. 31. We'll have more to say in a few weeks in our fundraiser retrospective, but for now, a big thank you to the ~240 donors who together donated more than $526,000, including $67,484 in the first 20 seconds of Giving Tuesday (not counting matching dollars, which have yet to be announced).
- Jan. 15 is the final day of CFAR's annual fundraiser. CFAR also recently ran an AMA and has posted their workshop participant handbook online.
- Understanding “Deep Double Descent”: MIRI researcher Evan Hubinger describes a fascinating phenomenon in ML, and an interesting case study in ML research aimed at deepening our understanding, and not just advancing capabilities. In a follow-up post, Evan also considers possible implications for alignment research.
- Safe Exploration and Corrigibility: Evan notes an important (and alignment-relevant) way that notions of exploration in deep RL have shifted.
- “Learning Human Objectives by Evaluating Hypothetical Behavior”: UC Berkeley and DeepMind researchers “present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states”.
Links from the research team
This continues my experiment from last month: having MIRI researchers anonymously pick out AI Alignment Forum posts to highlight and comment on.
- Re (When) is Truth-telling Favored in AI debate? — “A paper by Vojtěch Kovařík and Ryan Carey; it's good to see some progress on the debate model!”
- Re Recent Progress in the Theory of Neural Networks — “Noah MacAulay provides another interesting example of research attempting to explain what's going on with NNs.”
- Re When Goodharting is optimal — “I like Stuart Armstrong's post for the systematic examination of why we might be afraid of Goodharting. The example at the beginning is an interesting one, because it seems (to me at least) like the robot really should go back and forth (staying a long time at each side to minimize lost utility). But Stuart is right that this answer is, at least, quite difficult to justify.”
- Re Seeking Power is Instrumentally Convergent in MDPs and Clarifying Power-Seeking and Instrumental Convergence — “It's nice to finally have a formal model of this, thanks to Alex Turner and Logan Smith. Instrumental convergence has been an informal part of the discussion for a long time.”
- Re Critiquing “What failure looks like” — “I thought that Grue Slinky's post was a good critical analysis of Paul Christiano's ‘going out with a whimper’ scenario, highlighting some of the problems it seems to have as a concrete AI risk scenario. In particular, I found the analogy given to the simplex algorithm persuasive in terms of showcasing how, despite the fact that many of our current most powerful tools already have massive differentials in how well they work on different problems, those values which are not served well by those tools don't seem to have lost out massively as a result. I still feel like there may be a real risk along the lines of ‘going out with a whimper’, but I think this post presents a real challenge to that scenario as it has been described so far.”
- Re Counterfactual Induction — “A proposal for logical counterfactuals by Alex Appel. This could use some more careful thought and critique; it's not yet clear exactly how much or little it accomplishes.”
- Re A dilemma for prosaic AI alignment — “Daniel Kokotajlo outlines key challenges for prosaic alignment: ‘[…] Now I think the problem is substantially harder than that: To be competitive prosaic AI safety schemes must deliberately create misaligned mesa-optimizers and then (hopefully) figure out how to align them so that they can be used in the scheme.’”