July 2022 Newsletter

 |   |  Newsletters

A central AI alignment problem: capabilities generalization, and the sharp left turn

 |   |  Analysis

(This post was factored out of a larger post that I (Nate Soares) wrote, with help from Rob Bensinger, who also rearranged some pieces and added some text to smooth things out. I’m not terribly happy with it, but am posting it anyway (or, well, having Rob post it on my behalf while I travel) on the theory that it’s better than nothing.)


I expect navigating the acute risk period to be tricky for our civilization, for a number of reasons. Success looks to me to require clearing a variety of technical, sociopolitical, and moral hurdles, and while in principle sufficient mastery of solutions to the technical problems might substitute for solutions to the sociopolitical and other problems, it nevertheless looks to me like we need a lot of things to go right.

Some sub-problems look harder to me than others. For instance, people are still regularly surprised when I tell them that I think the hard bits are much more technical than moral: it looks to me like figuring out how to aim an AGI at all is harder than figuring out where to aim it.[1]

Within the list of technical obstacles, there are some that strike me as more central than others, like “figure out how to aim optimization”. And a big reason why I’m currently fairly pessimistic about humanity’s odds is that it seems to me like almost nobody is focusing on the technical challenges that seem most central and unavoidable to me.

Many people wrongly believe that I’m pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That’s flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.[2]

I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it’s somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it’s somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it’s all that qualitatively different than the sorts of summits humanity has surmounted before.

It’s made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.

What undermines my hope is that nobody seems to be working on the hard bits, and I don’t currently expect most people to become convinced that they need to solve those hard bits until it’s too late.

Below, I’ll attempt to sketch out what I mean by “the hard bits” of the alignment problem. Although these look hard, I’m a believer in the capacity of humanity to solve technical problems at this level of difficulty when we put our minds to it. My concern is that I currently don’t think the field is trying to solve this problem. My hope in writing this post is to better point at the problem, with a follow-on hope that this causes new researchers entering the field to attack what seem to me to be the central challenges head-on.

 

Discussion of a problem

On my model, one of the most central technical challenges of alignment—and one that every viable alignment plan will probably need to grapple with—is the issue that capabilities generalize better than alignment.

My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.

Read more »

AGI Ruin: A List of Lethalities

 |   |  Analysis

Preamble:

(If you’re already familiar with all basics and don’t want any preamble, skip ahead to Section B for technical difficulties of alignment proper.)

I have several times failed to write up a well-organized list of reasons why AGI will kill you.  People come in with different ideas about why AGI would be survivable, and want to hear different obviously key points addressed first.  Some fraction of those people are loudly upset with me if the obviously most important points aren’t addressed immediately, and I address different points first instead.

Having failed to solve this problem in any good way, I now give up and solve it poorly with a poorly organized list of individual rants.  I’m not particularly happy with this list; the alternative was publishing nothing, and publishing this seems marginally more dignified.

Three points about the general subject matter of discussion here, numbered so as not to conflict with the list of lethalities:

-3.  I’m assuming you are already familiar with some basics, and already know what ‘orthogonality’ and ‘instrumental convergence’ are and why they’re true.  People occasionally claim to me that I need to stop fighting old wars here, because, those people claim to me, those wars have already been won within the important-according-to-them parts of the current audience.  I suppose it’s at least true that none of the current major EA funders seem to be visibly in denial about orthogonality or instrumental convergence as such; so, fine.  If you don’t know what ‘orthogonality’ or ‘instrumental convergence’ are, or don’t see for yourself why they’re true, you need a different introduction than this one.

-2.  When I say that alignment is lethally difficult, I am not talking about ideal or perfect goals of ‘provable’ alignment, nor total alignment of superintelligences on exact human values, nor getting AIs to produce satisfactory arguments about moral dilemmas which sorta-reasonable humans disagree about, nor attaining an absolute certainty of an AI not killing everyone.  When I say that alignment is difficult, I mean that in practice, using the techniques we actually have, “please don’t disassemble literally everyone with probability roughly 1” is an overly large ask that we are not on course to get.  So far as I’m concerned, if you can get a powerful AGI that carries out some pivotal superhuman engineering task, with a less than fifty percent change of killing more than one billion people, I’ll take it.  Even smaller chances of killing even fewer people would be a nice luxury, but if you can get as incredibly far as “less than roughly certain to kill everybody”, then you can probably get down to under a 5% chance with only slightly more effort.  Practically all of the difficulty is in getting to “less than certainty of killing literally everyone”.  Trolley problems are not an interesting subproblem in all of this; if there are any survivors, you solved alignment.  At this point, I no longer care how it works, I don’t care how you got there, I am cause-agnostic about whatever methodology you used, all I am looking at is prospective results, all I want is that we have justifiable cause to believe of a pivotally useful AGI ‘this will not kill literally everyone’.  Anybody telling you I’m asking for stricter ‘alignment’ than this has failed at reading comprehension.  The big ask from AGI alignment, the basic challenge I am saying is too difficult, is to obtain by any strategy whatsoever a significant chance of there being any survivors.

-1.  None of this is about anything being impossible in principle.  The metaphor I usually use is that if a textbook from one hundred years in the future fell into our hands, containing all of the simple ideas that actually work robustly in practice, we could probably build an aligned superintelligence in six months.  For people schooled in machine learning, I use as my metaphor the difference between ReLU activations and sigmoid activations.  Sigmoid activations are complicated and fragile, and do a terrible job of transmitting gradients through many layers; ReLUs are incredibly simple (for the unfamiliar, the activation function is literally max(x, 0)) and work much better.  Most neural networks for the first decades of the field used sigmoids; the idea of ReLUs wasn’t discovered, validated, and popularized until decades later.  What’s lethal is that we do not have the Textbook From The Future telling us all the simple solutions that actually in real life just work and are robust; we’re going to be doing everything with metaphorical sigmoids on the first critical try.  No difficulty discussed here about AGI alignment is claimed by me to be impossible – to merely human science and engineering, let alone in principle – if we had 100 years to solve it using unlimited retries, the way that science usually has an unbounded time budget and unlimited retries.  This list of lethalities is about things we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle.

That said:

Here, from my perspective, are some different true things that could be said, to contradict various false things that various different people seem to believe, about why AGI would be survivable on anything remotely remotely resembling the current pathway, or any other pathway we can easily jump to.

 

Section A:

This is a very lethal problem, it has to be solved one way or another, it has to be solved at a minimum strength and difficulty level instead of various easier modes that some dream about, we do not have any visible option of ‘everyone’ retreating to only solve safe weak problems instead, and failing on the first really dangerous try is fatal.

 

1.  Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games.  Anyone relying on “well, it’ll get up to human capability at Go, but then have a hard time getting past that because it won’t be able to learn from humans any more” would have relied on vacuum.  AGI will not be upper-bounded by human ability or human learning speed.  Things much smarter than human would be able to learn from less evidence than humans require to have ideas driven into their brains; there are theoretical upper bounds here, but those upper bounds seem very high. (Eg, each bit of information that couldn’t already be fully predicted can eliminate at most half the probability mass of all hypotheses under consideration.)  It is not naturally (by default, barring intervention) the case that everything takes place on a timescale that makes it easy for us to react.

Read more »

Six Dimensions of Operational Adequacy in AGI Projects

 |   |  Analysis

Editor’s note:  The following is a lightly edited copy of a document written by Eliezer Yudkowsky in November 2017. Since this is a snapshot of Eliezer’s thinking at a specific time, we’ve sprinkled reminders throughout that this is from 2017.

A background note:

It’s often the case that people are slow to abandon obsolete playbooks in response to a novel challenge. And AGI is certainly a very novel challenge.

Italian general Luigi Cadorna offers a memorable historical example. In the Isonzo Offensive of World War I, Cadorna lost hundreds of thousands of men in futile frontal assaults against enemy trenches defended by barbed wire and machine guns.  As morale plummeted and desertions became epidemic, Cadorna began executing his own soldiers en masse, in an attempt to cure the rest of their “cowardice.” The offensive continued for 2.5 years.

Cadorna made many mistakes, but foremost among them was his refusal to recognize that this war was fundamentally unlike those that had come before.  Modern weaponry had forced a paradigm shift, and Cadorna’s instincts were not merely miscalibrated—they were systematically broken.  No number of small, incremental updates within his obsolete framework would be sufficient to meet the new challenge.

Other examples of this type of mistake include the initial response of the record industry to iTunes and streaming; or, more seriously, the response of most Western governments to COVID-19.

 

 

As usual, the real challenge of reference class forecasting is figuring out which reference class the thing you’re trying to model belongs to.

For most problems, rethinking your approach from the ground up is wasteful and unnecessary, because most problems have a similar causal structure to a large number of past cases. When the problem isn’t commensurate with existing strategies, as in the case of AGI, you need a new playbook.

 


 

I’ve sometimes been known to complain, or in a polite way scream in utter terror, that “there is no good guy group in AGI”, i.e., if a researcher on this Earth currently wishes to contribute to the common good, there are literally zero projects they can join and no project close to being joinable.  In its present version, this document is an informal response to an AI researcher who asked me to list out the qualities of such a “good project”.

In summary, a “good project” needs:

  • Trustworthy command:  A trustworthy chain of command with respect to both legal and pragmatic control of the intellectual property (IP) of such a project; a running AGI being included as “IP” in this sense.
  • Research closure:  The organizational ability to close and/or silo IP to within a trustworthy section and prevent its release by sheer default.
  • Strong opsec:  Operational security adequate to prevent the proliferation of code (or other information sufficient to recreate code within e.g. 1 year) due to e.g. Russian intelligence agencies grabbing the code.
  • Common good commitment:  The project’s command and its people must have a credible commitment to both short-term and long-term goodness.  Short-term goodness comprises the immediate welfare of present-day Earth; long-term goodness is the achievement of transhumanist astronomical goods.
  • Alignment mindset:  Somebody on the project needs deep enough security mindset plus understanding of AI cognition that they can originate new, deep measures to ensure AGI alignment; and they must be in a position of technical control or otherwise have effectively unlimited political capital.  Everybody on the project needs to understand and expect that aligning an AGI will be terrifically difficult and terribly dangerous.
  • Requisite resource levels:  The project must have adequate resources to compete at the frontier of AGI development, including whatever mix of computational resources, intellectual labor, and closed insights are required to produce a 1+ year lead over less cautious competing projects.

I was asked what would constitute “minimal, adequate, and good” performance on each of these dimensions.  I tend to divide things sharply into “not adequate” and “adequate” but will try to answer in the spirit of the question nonetheless.
Read more »

Shah and Yudkowsky on alignment failures

 |   |  Analysis, Conversations

 

This is the final discussion log in the Late 2021 MIRI Conversations sequence, featuring Rohin Shah and Eliezer Yudkowsky, with additional comments from Rob Bensinger, Nate Soares, Richard Ngo, and Jaan Tallinn.

The discussion begins with summaries and comments on Richard and Eliezer’s debate. Rohin’s summary has since been revised and published in the Alignment Newsletter.

After this log, we’ll be concluding this sequence with an AMA, where we invite you to comment with questions about AI alignment, cognition, forecasting, etc. Eliezer, Richard, Paul Christiano, Nate, and Rohin will all be participating.

 

Color key:

 Chat by Rohin and Eliezer   Other chat   Emails   Follow-ups 

 

19. Follow-ups to the Ngo/Yudkowsky conversation

 

19.1. Quotes from the public discussion

 

[Bensinger][9:22]

Interesting extracts from the public discussion of Ngo and Yudkowsky on AI capability gains:

Eliezer:

I think some of your confusion may be that you’re putting “probability theory” and “Newtonian gravity” into the same bucket.  You’ve been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though).  “Probability theory” also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem.  That theorem is widely used and praised, so it’s “powerful”, and it wasn’t invented before probability theory, so it’s “advance”, right?  So we can go on putting probability theory in the same bucket as Newtonian gravity?

They’re actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones.  It seems like the sort of thing that would take a subsequence I don’t have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which “How do we trust this, when” differs between “I have the following new empirical theory about the underlying model of gravity” and “I think that the logical notion of ‘arithmetic’ is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions…”  But at least step one could be saying, “Wait, do these two kinds of ideas actually go into the same bucket at all?”

In particular it seems to me that you want properly to be asking “How do we know this empirical thing ends up looking like it’s close to the abstraction?” and not “Can you show me that this abstraction is a very powerful one?”  Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves – but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy.  I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you’d been previously raised to believe in as universal standards about all ideas.

Richard:

I agree that “powerful” is probably not the best term here, so I’ll stop using it going forward (note, though, that I didn’t use it in my previous comment, which I endorse more than my claims in the original debate).

But before I ask “How do we know this empirical thing ends up looking like it’s close to the abstraction?”, I need to ask “Does the abstraction even make sense?” Because you have the abstraction in your head, and I don’t, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.

Perhaps another way of saying it is that they’re not crisp/robust/coherent concepts (although I’m open to other terms, I don’t think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you’re using is a crisper concept than Soros’ theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that’d be the most reliable way – but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don’t think Soros could come up with.

I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you’re still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it’s the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.

Of course there’s a social component to both, but that’s not what I’m primarily interested in. And of course there’s a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I’d thank you to assume I’m at least making a more interesting error than that.

Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn’t question how much sense it makes to use calculus in the way that you described, but that’s because the application of calculus to mechanics is so clearly-defined that it’d be very hard for the type of confusion I talked about above to sneak in. I’d put evolutionary theory halfway between them: it’s partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it’s easy to do so in a confused way.

 

19.2. Rohin Shah’s summary and thoughts

 

[Shah][7:06]  (Nov. 6 email)

Newsletter summaries attached, would appreciate it if Eliezer and Richard checked that I wasn’t misrepresenting them. (Conversation is a lot harder to accurately summarize than blog posts or papers.)

 

Best,

Rohin

 

Planned summary for the Alignment Newsletter:

 

Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument is roughly as follows:

Read more »

Ngo and Yudkowsky on scientific reasoning and pivotal acts

 |   |  Analysis, Conversations

This is a transcript of a conversation between Richard Ngo and Eliezer Yudkowsky, facilitated by Nate Soares (and with some comments from Carl Shulman). This transcript continues the Late 2021 MIRI Conversations sequence, following Ngo’s view on alignment difficulty.

 

Color key:

 Chat by Richard and Eliezer   Other chat 

 

 

14. October 4 conversation

 

14.1. Predictable updates, threshold functions, and the human cognitive range

 

[Ngo][15:05]

Two questions which I’d like to ask Eliezer:

1. How strongly does he think that the “shallow pattern-memorisation” abilities of GPT-3 are evidence for Paul’s view over his view (if at all)

2. How does he suggest we proceed, given that he thinks directly explaining his model of the chimp-human difference would be the wrong move?

[Yudkowsky][15:07]

1 – I’d say that it’s some evidence for the Dario viewpoint which seems close to the Paul viewpoint.  I say it’s some evidence for the Dario viewpoint because Dario seems to be the person who made something like an advance prediction about it.  It’s not enough to make me believe that you can straightforwardly extend the GPT architecture to 3e14 parameters and train it on 1e13 samples and get human-equivalent performance.

[Ngo][15:09]

Did you make any advance predictions, around the 2008-2015 period, of what capabilities we’d have before AGI?

[Yudkowsky][15:10]

not especially that come to mind?  on my model of the future this is not particularly something I am supposed to know unless there is a rare flash of predictability.

Read more »

Christiano and Yudkowsky on AI predictions and human intelligence

 |   |  Analysis, Conversations

 

This is a transcript of a conversation between Paul Christiano and Eliezer Yudkowsky, with comments by Rohin Shah, Beth Barnes, Richard Ngo, and Holden Karnofsky, continuing the Late 2021 MIRI Conversations.

Color key:

 Chat by Paul and Eliezer   Other chat 

 

15. October 19 comment

 

[Yudkowsky][11:01]

thing that struck me as an iota of evidence for Paul over Eliezer:

https://twitter.com/tamaybes/status/1450514423823560706?s=20 

Read more »

February 2022 Newsletter

 |   |  Newsletters