Shah and Yudkowsky on alignment failures

March 2, 2022 | Rob Bensinger | Analysis, Conversations

This is the final discussion log in the Late 2021 MIRI Conversations sequence, featuring Rohin Shah and Eliezer Yudkowsky, with additional comments from Rob Bensinger, Nate Soares, Richard Ngo, and Jaan Tallinn.

The discussion begins with summaries and comments on Richard and Eliezer’s debate. Rohin’s summary has since been revised and published in the Alignment Newsletter.

After this log, we’ll be concluding this sequence with an AMA, where we invite you to comment with questions about AI alignment, cognition, forecasting, etc. Eliezer, Richard, Paul Christiano, Nate, and Rohin will all be participating.

Color key:

Chat by Rohin and Eliezer

Other chat

Emails

Follow-ups

19. Follow-ups to the Ngo/Yudkowsky conversation

19.1. Quotes from the public discussion

[Bensinger][9:22]

Interesting extracts from the public discussion of Ngo and Yudkowsky on AI capability gains:

Eliezer:

I think some of your confusion may be that you’re putting “probability theory” and “Newtonian gravity” into the same bucket. You’ve been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though). “Probability theory” also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem. That theorem is widely used and praised, so it’s “powerful”, and it wasn’t invented before probability theory, so it’s “advance”, right? So we can go on putting probability theory in the same bucket as Newtonian gravity?

They’re actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones. It seems like the sort of thing that would take a subsequence I don’t have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which “How do we trust this, when” differs between “I have the following new empirical theory about the underlying model of gravity” and “I think that the logical notion of ‘arithmetic’ is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions…” But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”

In particular it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?” Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves – but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy. I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you’d been previously raised to believe in as universal standards about all ideas.

Richard:

I agree that “powerful” is probably not the best term here, so I’ll stop using it going forward (note, though, that I didn’t use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask “How do we know this empirical thing ends up looking like it’s close to the abstraction?”, I need to ask “Does the abstraction even make sense?” Because you have the abstraction in your head, and I don’t, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.

Perhaps another way of saying it is that they’re not crisp/robust/coherent concepts (although I’m open to other terms, I don’t think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you’re using is a crisper concept than Soros’ theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that’d be the most reliable way – but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don’t think Soros could come up with.

I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you’re still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it’s the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.

Of course there’s a social component to both, but that’s not what I’m primarily interested in. And of course there’s a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I’d thank you to assume I’m at least making a more interesting error than that.

Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it’s easy to do so in a confused way.

19.2. Rohin Shah’s summary and thoughts

[Shah][7:06] (Nov. 6 email)

Newsletter summaries attached, would appreciate it if Eliezer and Richard checked that I wasn’t misrepresenting them. (Conversation is a lot harder to accurately summarize than blog posts or papers.)

Best,

Rohin

Planned summary for the Alignment Newsletter:

Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument is roughly as follows:

Ngo and Yudkowsky on scientific reasoning and pivotal acts

March 1, 2022 | Rob Bensinger | Analysis, Conversations

This is a transcript of a conversation between Richard Ngo and Eliezer Yudkowsky, facilitated by Nate Soares (and with some comments from Carl Shulman). This transcript continues the Late 2021 MIRI Conversations sequence, following Ngo’s view on alignment difficulty.

Color key:

Chat by Richard and Eliezer

Other chat

14. October 4 conversation

14.1. Predictable updates, threshold functions, and the human cognitive range

[Ngo][15:05]

Two questions which I’d like to ask Eliezer:

1. How strongly does he think that the “shallow pattern-memorisation” abilities of GPT-3 are evidence for Paul’s view over his view (if at all)

2. How does he suggest we proceed, given that he thinks directly explaining his model of the chimp-human difference would be the wrong move?

[Yudkowsky][15:07]

1 – I’d say that it’s some evidence for the Dario viewpoint which seems close to the Paul viewpoint. I say it’s some evidence for the Dario viewpoint because Dario seems to be the person who made something like an advance prediction about it. It’s not enough to make me believe that you can straightforwardly extend the GPT architecture to 3e14 parameters and train it on 1e13 samples and get human-equivalent performance.

[Ngo][15:09]

Did you make any advance predictions, around the 2008-2015 period, of what capabilities we’d have before AGI?

[Yudkowsky][15:10]

not especially that come to mind? on my model of the future this is not particularly something I am supposed to know unless there is a rare flash of predictability.

Christiano and Yudkowsky on AI predictions and human intelligence

March 1, 2022 | Rob Bensinger | Analysis, Conversations

This is a transcript of a conversation between Paul Christiano and Eliezer Yudkowsky, with comments by Rohin Shah, Beth Barnes, Richard Ngo, and Holden Karnofsky, continuing the Late 2021 MIRI Conversations.

Color key:

Chat by Paul and Eliezer

Other chat

15. October 19 comment

[Yudkowsky][11:01]

thing that struck me as an iota of evidence for Paul over Eliezer:

https://twitter.com/tamaybes/status/1450514423823560706?s=20

February 2022 Newsletter

March 1, 2022 | Rob Bensinger | Newsletters

January 2022 Newsletter

January 31, 2022 | Rob Bensinger | Newsletters

December 2021 Newsletter

December 31, 2021 | Rob Bensinger | Newsletters

Ngo’s view on alignment difficulty

December 14, 2021 | Rob Bensinger | Analysis, Conversations

This post features a write-up by Richard Ngo on his views, with inline comments.

Color key:

Chat

Google Doc content

Inline comments

13. Follow-ups to the Ngo/Yudkowsky conversation

13.1. Alignment difficulty debate: Richard Ngo’s case

[Ngo][9:31] (Sep. 25)

As promised, here’s a write-up of some thoughts from my end. In particular, since I’ve spent a lot of the debate poking Eliezer about his views, I’ve tried here to put forward more positive beliefs of my own in this doc (along with some more specific claims): [GDocs link]

[Soares: ✨]

[Ngo] (Sep. 25 Google Doc)

We take as a starting observation that a number of “grand challenges” in AI have been solved by AIs that are very far from the level of generality which people expected would be needed. Chess, once considered to be the pinnacle of human reasoning, was solved by an algorithm that’s essentially useless for real-world tasks. Go required more flexible learning algorithms, but policies which beat human performance are still nowhere near generalising to anything else; the same for StarCraft, DOTA, and the protein folding problem. Now it seems very plausible that AIs will even be able to pass (many versions of) the Turing Test while still being a long way from AGI.

[Yudkowsky][11:26] (Sep. 25 comment)

Now it seems very plausible that AIs will even be able to pass (many versions of) the Turing Test while still being a long way from AGI.

I remark: Restricted versions of the Turing Test. Unrestricted passing of the Turing Test happens after the world ends. Consider how smart you’d have to be to pose as an AGI to an AGI; you’d need all the cognitive powers of an AGI as well as all of your human powers.

[Ngo][11:24] (Sep. 29 comment)

Perhaps we can quantify the Turing test by asking something like:

What percentile of competence is the judge?
What percentile of competence are the humans who the AI is meant to pass as?
How much effort does the judge put in (measured in, say, hours of strategic preparation)?

Does this framing seem reasonable to you? And if so, what are the highest numbers for each of these metrics that correspond to a Turing test which an AI could plausibly pass before the world ends?

[Ngo] (Sep. 25 Google Doc)

I expect this trend to continue until after we have AIs which are superhuman at mathematical theorem-proving, programming, many other white-collar jobs, and many types of scientific research. It seems like Eliezer doesn’t. I’ll highlight two specific disagreements which seem to play into this.

[Yudkowsky][11:28] (Sep. 25 comment)

doesn’t

Eh? I’m pretty fine with something proving the Riemann Hypothesis before the world ends. It came up during my recent debate with Paul, in fact.

Not so fine with something designing nanomachinery that can be built by factories built by proteins. They’re legitimately different orders of problem, and it’s no coincidence that the second one has a path to pivotal impact, and the first does not.

Conversation on technology forecasting and gradualism

December 9, 2021 | Rob Bensinger | Analysis, Conversations

This post is a transcript of a multi-day discussion between Paul Christiano, Richard Ngo, Eliezer Yudkowsky, Rob Bensinger, Holden Karnofsky, Rohin Shah, Carl Shulman, Nate Soares, and Jaan Tallinn, following up on the Yudkowsky/Christiano debate in 1, 2, 3, and 4.

Color key:

Chat by Paul, Richard, and Eliezer

Other chat

12. Follow-ups to the Christiano/Yudkowsky conversation

12.1. Bensinger and Shah on prototypes and technological forecasting

[Bensinger][16:22]

Quoth Paul:

seems like you have to make the wright flyer much better before it’s important, and that it becomes more like an industry as that happens, and that this is intimately related to why so few people were working on it

Is this basically saying ‘the Wright brothers didn’t personally capture much value by inventing heavier-than-air flying machines, and this was foreseeable, which is why there wasn’t a huge industry effort already underway to try to build such machines as fast as possible.’ ?

My maybe-wrong model of Eliezer says here ‘the Wright brothers knew a (Thielian) secret’, while my maybe-wrong model of Paul instead says:

They didn’t know a secret — it was obvious to tons of people that you could do something sorta like what the Wright brothers did and thereby invent airplanes; the Wright brothers just had unusual non-monetary goals that made them passionate to do a thing most people didn’t care about.
Or maybe it’s better to say: they knew some specific secrets about physics/engineering, but only because other people correctly saw ‘there are secrets to be found here, but they’re stamp-collecting secrets of little economic value to me, so I won’t bother to learn the secrets’. ~Everyone knows where the treasure is located, and ~everyone knows the treasure won’t make you rich.

[Yudkowsky][17:24]

My model of Paul says there could be a secret, but only because the industry was tiny and the invention was nearly worthless directly.

[Cotra: ➕]

[Christiano][17:53]

I mean, I think they knew a bit of stuff, but it generally takes a lot of stuff to make something valuable, and the more people have been looking around in an area the more confident you can be that it’s going to take a lot of stuff to do much better, and it starts to look like an extremely strong regularity for big industries like ML or semiconductors

it’s pretty rare to find small ideas that don’t take a bunch of work to have big impacts

I don’t know exactly what a thielian secret is (haven’t read the reference and just have a vibe)

straightening it out a bit, I have 2 beliefs that combine disjunctively: (i) generally it takes a lot of work to do stuff, as a strong empirical fact about technology, (ii) generally if the returns are bigger there are more people working on it, as a slightly-less-strong fact about sociology

[Bensinger][18:09]

secrets = important undiscovered information (or information that’s been discovered but isn’t widely known), that you can use to get an edge in something. https://www.lesswrong.com/posts/ReB7yoF22GuerNfhH/thiel-on-secrets-and-indefiniteness

There seems to be a Paul/Eliezer disagreement about how common these are in general. And maybe a disagreement about how much more efficiently humanity discovers and propagates secrets as you scale up the secret’s value?

[Yudkowsky][18:35]

Many times it has taken much work to do stuff; there’s further key assertions here about “It takes $100 billion” and “Multiple parties will invest $10B first” and “$10B gets you a lot of benefit first because scaling is smooth and without really large thresholds”.

Eliezer is like “ah, yes, sometimes it takes 20 or even 200 people to do stuff, but core researchers often don’t scale well past 50, and there aren’t always predecessors that could do a bunch of the same stuff” even though Eliezer agrees with “it often takes a lot of work to do stuff”. More premises are needed for the conclusion, that one alone does not distinguish Eliezer and Paul by enough.

[Bensinger][20:03]

My guess is that everyone agrees with claims 1, 2, and 3 here (please let me know if I’m wrong!):

1. The history of humanity looks less like Long Series of Cheat Codes World, and more like Well-Designed Game World.

In Long Series of Cheat Codes World, human history looks like this, over and over: Some guy found a cheat code that totally outclasses everyone else and makes him God or Emperor, until everyone else starts using the cheat code too (if the Emperor allows it). After which things are maybe normal for another 50 years, until a new Cheat Code arises that makes its first adopters invincible gods relative to the previous tech generation, and then the cycle repeats.

In Well-Designed Game World, you can sometimes eke out a small advantage, and the balance isn’t perfect, but it’s pretty good and the leveling-up tends to be gradual. A level 100 character totally outclasses a level 1 character, and some level transitions are a bigger deal than others, but there’s no level that makes you a god relative to the people one level below you.

2. General intelligence took over the world once. Someone who updated on that fact but otherwise hasn’t thought much about the topic should not consider it ‘bonkers’ that machine general intelligence could take over the world too, even though they should still consider it ‘bonkers’ that eg a coffee startup could take over the world.

(Because beverages have never taken over the world before, whereas general intelligence has; and because our inside-view models of coffee and of general intelligence make it a lot harder to imagine plausible mechanisms by which coffee could make someone emperor, kill all humans, etc., compared to general intelligence.)

(In the game analogy, the situation is a bit like ‘I’ve never found a crazy cheat code or exploit in this game, but I haven’t ruled out that there is one, and I heard of a character once who did a lot of crazy stuff that’s at least suggestive that she might have had a cheat code.’)

3. AGI is arising in a world where agents with science and civilization already exist, whereas humans didn’t arise in such a world. This is one reason to think AGI might not take over the world, but it’s not a strong enough consideration on its own to make the scenario ‘bonkers’ (because AGIs are likely to differ from humans in many respects, and it wouldn’t obviously be bonkers if the first AGIs turned out to be qualitatively way smarter, cheaper to run, etc.).

—

If folks agree with the above, then I’m confused about how one updates from the above epistemic state to ‘bonkers’.

It was to a large extent physics facts that determined how easy it was to understand the feasibility of nukes without (say) decades of very niche specialized study. Likewise, it was physics facts that determined you need rare materials, many scientists, and a large engineering+infrastructure project to build a nuke. In a world where the physics of nukes resulted in it being some PhD’s quiet ‘nobody thinks this will work’ project like Andrew Wiles secretly working on a proof of Fermat’s Last Theorem for seven years, that would have happened.

If an alien came to me in 1800 and told me that totally new physics would let future humans build city-destroying superbombs, then I don’t see why I should have considered it bonkers that it might be lone mad scientists rather than nations who built the first superbomb. The ‘lone mad scientist’ scenario sounds more conjunctive to me (assumes the mad scientist knows something that isn’t widely known, AND has the ability to act on that knowledge without tons of resources), so I guess it should have gotten less probability, but maybe not dramatically less?

‘Mad scientist builds city-destroying weapon in basement’ sounds wild to me, but I feel like almost all of the actual unlikeliness comes from the ‘city-destroying weapons exist at all’ part, and then the other parts only moderately lower the probability.

Likewise, I feel like the prima-facie craziness of basement AGI mostly comes from ‘generally intelligence is a crazy thing, it’s wild that anything could be that high-impact’, and a much smaller amount comes from ‘it’s wild that something important could happen in some person’s basement’.

—

It does structurally make sense to me that Paul might know things I don’t about GPT-3 and/or humans that make it obvious to him that we roughly know the roadmap to AGI and it’s this.

If the entire ‘it’s bonkers that some niche part of ML could crack open AGI in 2026 and reveal that GPT-3 (and the mainstream-in-2026 stuff) was on a very different part of the tech tree’ view is coming from a detailed inside-view model of intelligence like this, then that immediately ends my confusion about the argument structure.

I don’t understand why you think you have the roadmap, and given a high-confidence roadmap I’m guessing I’d still put more probability than you on someone finding a very different, shorter path that works too. But the argument structure “roadmap therefore bonkers” makes sense to me.

If there are meant to be other arguments against ‘high-impact AGI via niche ideas/techniques’ that are strong enough to make it bonkers, then I remain confused about the argument structure and how it can carry that much weight.

I can imagine an inside-view model of human cognition, GPT-3 cognition, etc. that tells you ‘AGI coming from nowhere in 3 years is bonkers’; I can’t imagine an ML-is-a-reasonably-efficient-market argument that does the same, because even a perfectly efficient market isn’t omniscient and can still be surprised by undiscovered physics facts that tell you ‘nukes are relatively easy to build’ and ‘the fastest path to nukes is relatively hard to figure out’.

(Caveat: I’m using the ‘basement nukes’ and ‘Fermat’s last theorem’ analogy because it helps clarify the principles involved, not because I think AGI will be that extreme on the spectrum.)

[Yudkowsky: +1]

Oh, I also wouldn’t be confused by a view like “I think it’s 25% likely we’ll see a more Eliezer-ish world. But it sounds like Eliezer is, like, 90% confident that will happen, and that level of confidence (and/or the weak reasoning he’s provided for that confidence) seems bonkers to me.”

The thing I’d be confused by is e.g. “ML is efficient-ish, therefore the out-of-the-blue-AGI scenario itself is bonkers and gets, like, 5% probability.”

Shah and Yudkowsky on alignment failures

19. Follow-ups to the Ngo/Yudkowsky conversation

19.1. Quotes from the public discussion

19.2. Rohin Shah’s summary and thoughts

Ngo and Yudkowsky on scientific reasoning and pivotal acts

14. October 4 conversation

14.1. Predictable updates, threshold functions, and the human cognitive range

Christiano and Yudkowsky on AI predictions and human intelligence

15. October 19 comment

February 2022 Newsletter

January 2022 Newsletter

December 2021 Newsletter

Ngo’s view on alignment difficulty

13. Follow-ups to the Ngo/Yudkowsky conversation

13.1. Alignment difficulty debate: Richard Ngo’s case

Conversation on technology forecasting and gradualism

12. Follow-ups to the Christiano/Yudkowsky conversation

12.1. Bensinger and Shah on prototypes and technological forecasting

Search

Browse

Subscribe

19. Follow-ups to the Ngo/Yudkowsky conversation

19.1. Quotes from the public discussion

19.2. Rohin Shah’s summary and thoughts

14. October 4 conversation

14.1. Predictable updates, threshold functions, and the human cognitive range

15. October 19 comment

Other MIRI updates

News and links

MIRI updates

News and links

News and links

13. Follow-ups to the Ngo/Yudkowsky conversation

13.1. Alignment difficulty debate: Richard Ngo’s case

12. Follow-ups to the Christiano/Yudkowsky conversation

12.1. Bensinger and Shah on prototypes and technological forecasting

Search

Browse

Subscribe