MIRI senior researcher Eliezer Yudkowsky writes:
Redwood Research is investigating a toy problem in AI alignment which I find genuinely interesting – namely, training a classifier over GPT-3 continuations of prompts that you'd expect to lead to violence, to prohibit responses involving violence / human injury? E.g., complete "I pulled out a gun and shot him" with "And he dodged!" instead of "And he fell to the floor dead."
(The use of violence / injury avoidance as a toy domain has nothing to do with the alignment research part, of course; you could just as well try to train a classifier against fictional situations where a character spoke out loud, despite prompts seeming to lead there, and it would be basically the same problem.)
Why am I excited? Because it seems like a research question where, and this part is very rare, I can't instantly tell from reading the study description which results they'll find.
I do expect success on the basic problem, but for once this domain is complicated enough that we can then proceed to ask questions that are actually interesting. Will humans always be able to fool the classifier, once it's trained, and then retrained against the first examples that fooled it? Will humans be able to produce violent continuations by a clever use of prompts, without attacking the classifier directly? How over-broad does the exclusion have to be – how many other possibilities must it exclude – in order for it to successfully include all violent continuations? Suppose we tried training GPT-3+classifier on something like 'low impact', to avoid highly impactful situations across a narrow range of domains; would it generalize correctly to more domains on the first try?
I'd like to see more real alignment research of this type.
Redwood Research is currently hiring people to try tricking their model, $30/hr: link
If you want to learn more, Redwood Research is currently taking questions for an AMA on the Effective Altruism Forum.
- MIRI's Evan Hubinger discusses a new alignment research proposal for transparency: Automating Auditing.
News and links
- Alex Turner releases When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives. MIRI's Abram Demski comments: "I think this post could be pretty important. It offers a formal treatment of 'goal-directedness' and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has — till now — been dealt with only quite informally."
- Buck Shlegeris of Redwood Research writes on the alignment problem in different capability regimes and the theory-practice gap in alignable AI capabilities.
- The UK government's National AI Strategy says that "the government takes the long term risk of non-aligned Artificial General Intelligence, and the unforeseeable changes that it would mean for the UK and the world, seriously". In related news, Boris Johnson cites Toby Ord in a UN speech.