News and links
- Recent AI alignment posts: Evan Hubinger asks “Are minimal circuits deceptive?”, Paul Christiano describes the strategy-stealing assumption, and Wei Dai lists his resolved confusions about Iterated Distillation and Amplification. See also Rohin Shah's comparison of recursive approaches to AI alignment.
- Also on LessWrong: A Debate on Instrumental Convergence Between LeCun, Russell, Bengio, Zador, and More.
- FHI's Ben Garfinkel and Allan Dafoe argue that conflicts between nations tend to exhibit “offensive-then-defensive scaling”.
- OpenAI releases a follow-up report on GPT-2, noting that several groups “have explicitly adopted similar staged release approaches” to OpenAI.
- NVIDIA Applied Deep Learning Research has trained a model that appears to essentially replicate GPT-2, with 5.6x as many parameters, slightly better WikiText perplexity, and slightly worse LAMBADA accuracy. The group has elected to share their training and evaluation code, but not the model weights.
- OpenAI fine-tunes GPT-2 for text continuation and summarization tasks that incorporate human feedback, noting, “Our motivation is to move safety techniques closer to the general task of ‘machines talking to humans,’ which we believe is key to extracting information about human values.”