The basic reasons I expect AGI ruin

 |   |  Analysis

I’ve been citing AGI Ruin: A List of Lethalities to explain why the situation with AI looks lethally dangerous to me. But that post is relatively long, and emphasizes specific open technical problems over “the basics”.

Here are 10 things I’d focus on if I were giving “the basics” on why I’m so worried:[1]

1. General intelligence is very powerful, and once we can build it at all, STEM-capable artificial general intelligence (AGI) is likely to vastly outperform human intelligence immediately (or very quickly).

When I say “general intelligence”, I’m usually thinking about “whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems”.

It’s possible that we should already be thinking of GPT-4 as “AGI” on some definitions, so to be clear about the threshold of generality I have in mind, I’ll specifically talk about “STEM-level AGI”, though I expect such systems to be good at non-STEM tasks too.

Human brains aren’t perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.

More concretely:

  • AlphaGo is a very impressive reasoner, but its hypothesis space is limited to sequences of Go board states rather than sequences of states of the physical universe. Efficiently reasoning about the physical universe requires solving at least some problems that are different in kind from what AlphaGo solves.
    • These problems might be solved by the STEM AGI’s programmer, and/or solved by the algorithm that finds the AGI in program-space; and some such problems may be solved by the AGI itself in the course of refining its thinking.[2]
  • Some examples of abilities I expect humans to only automate once we’ve built STEM-level AGI (if ever):
    • The ability to perform open-heart surgery with a high success rate, in a messy non-standardized ordinary surgical environment.
    • The ability to match smart human performance in a specific hard science field, across all the scientific work humans do in that field.
  • In principle, I suspect you could build a narrow system that is good at those tasks while lacking the basic mental machinery required to do par-human reasoning about all the hard sciences. In practice, I very strongly expect humans to find ways to build general reasoners to perform those tasks, before we figure out how to build narrow reasoners that can do them. (For the same basic reason evolution stumbled on general intelligence so early in the history of human tech development.)[3]

When I say “general intelligence is very powerful”, a lot of what I mean is that science is very powerful, and that having all of the sciences at once is a lot more powerful than the sum of each science’s impact.[4]

Another large piece of what I mean is that (STEM-level) general intelligence is a very high-impact sort of thing to automate because STEM-level AGI is likely to blow human intelligence out of the water immediately, or very soon after its invention.

Read more »

Misgeneralization as a misnomer

 |   |  Analysis

Here’s two different ways an AI can turn out unfriendly:

  1. You somehow build an AI that cares about “making people happy”. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it’s more capable), it forcibly puts each human in a separate individual heavily-defended cell, and pumps them full of opiates.
  2. You build an AI that’s good at making people happy. In training, it tells people jokes and buys people flowers and offers people an ear when they need one. In deployment (and once it’s more capable), it turns out that whatever was causing that “happiness”-promoting behavior was a balance of a variety of other goals (such as basic desires for energy and memory), and it spends most of the universe on some combination of that other stuff that doesn’t involve much happiness.

(To state the obvious: please don’t try to get your AIs to pursue “happiness”; you want something more like CEV in the long run, and in the short run I strongly recommend aiming lower, at a pivotal act .)

In both cases, the AI behaves (during training) in a way that looks a lot like trying to make people happy. Then the AI described in (1) is unfriendly because it was optimizing the wrong concept of “happiness”, one that lined up with yours when the AI was weak, but that diverges in various edge-cases that matter when the AI is strong. By contrast, the AI described in (2) was never even really trying to pursue happiness; it had a mixture of goals that merely correlated with the training objective, and that balanced out right around where you wanted them to balance out in training, but deployment (and the corresponding capabilities-increases) threw the balance off.

Note that this list of “ways things can go wrong when the AI looked like it was optimizing happiness during training” is not exhaustive! (For instance, consider an AI that cares about something else entirely, and knows you’ll shut it down if it doesn’t look like it’s optimizing for happiness. Or an AI whose goals change heavily as it reflects and self-modifies.)

Read more »

Pausing AI Developments Isn’t Enough. We Need to Shut it All Down

 |   |  Analysis

(Published in TIME on March 29.)


An open letter published today calls for “all AI labs to immediately pause for at least 6 months the training of AI systems more powerful than GPT-4.”

This 6-month moratorium would be better than no moratorium. I have respect for everyone who stepped up and signed it. It’s an improvement on the margin.

I refrained from signing because I think the letter is understating the seriousness of the situation and asking for too little to solve it.

The key issue is not “human-competitive” intelligence (as the open letter puts it); it’s what happens after AI gets to smarter-than-human intelligence. Key thresholds there may not be obvious, we definitely can’t calculate in advance what happens when, and it currently seems imaginable that a research lab would cross critical lines without noticing.

Many researchers steeped in these issues, including myself, expect that the most likely result of building a superhumanly smart AI, under anything remotely like the current circumstances, is that literally everyone on Earth will die. Not as in “maybe possibly some remote chance,” but as in “that is the obvious thing that would happen.” It’s not that you can’t, in principle, survive creating something much smarter than you; it’s that it would require precision and preparation and new scientific insights, and probably not having AI systems composed of giant inscrutable arrays of fractional numbers.

Without that precision and preparation, the most likely outcome is AI that does not do what we want, and does not care for us nor for sentient life in general. That kind of caring is something that could in principle be imbued into an AI but we are not ready and do not currently know how.

Absent that caring, we get “the AI does not love you, nor does it hate you, and you are made of atoms it can use for something else.”

The likely result of humanity facing down an opposed superhuman intelligence is a total loss. Valid metaphors include “a 10-year-old trying to play chess against Stockfish 15”, “the 11th century trying to fight the 21st century,” and “Australopithecus trying to fight Homo sapiens“.

To visualize a hostile superhuman AI, don’t imagine a lifeless book-smart thinker dwelling inside the internet and sending ill-intentioned emails. Visualize an entire alien civilization, thinking at millions of times human speeds, initially confined to computers—in a world of creatures that are, from its perspective, very stupid and very slow. A sufficiently intelligent AI won’t stay confined to computers for long. In today’s world you can email DNA strings to laboratories that will produce proteins on demand, allowing an AI initially confined to the internet to build artificial life forms or bootstrap straight to postbiological molecular manufacturing.

If somebody builds a too-powerful AI, under present conditions, I expect that every single member of the human species and all biological life on Earth dies shortly thereafter.

Read more »

Truth and Advantage: Response to a draft of “AI safety seems hard to measure”

 |   |  Analysis

Status: This was a response to a draft of Holden’s cold take “AI safety seems hard to measure”. It sparked a further discussion, that Holden recently posted a summary of.

The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn’t show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)

I’m posting the document I wrote to Holden with only minimal editing, because it’s been a few months and I apparently won’t produce anything better. (I acknowledge that it’s annoying to post a response to an old draft of a thing when nobody can see the old draft, sorry.)

Quick take: (1) it’s a write-up of a handful of difficulties that I think are real, in a way that I expect to be palatable to a relevant different audience than the one I appeal to; huzzah for that. (2) It’s missing some stuff that I think is pretty important.

Read more »

Deep Deceptiveness

 |   |  Analysis


This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don’t recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.

You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there’s a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)

Caveat: I’ll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting “just train AI to not be deceptive; there’s a decent chance that works”.[1]

I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment.[2]

Caveat: I haven’t checked the relationship between my use of the word ‘deception’ here, and the use of the word ‘deceptive’ in discussions of “deceptive alignment“. Please don’t assume that the two words mean the same thing.

Investigating a made-up but moderately concrete story

Suppose you have a nascent AGI, and you’ve been training against all hints of deceptiveness. What goes wrong?

When I ask this question of people who are optimistic that we can just “train AIs not to be deceptive”, there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of ‘deception’, so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others to notice on their own.

That’s a challenge, and while you (hopefully) chew on it, I’ll tell an implausibly-detailed story to exemplify a deeper obstacle.

Read more »

Yudkowsky on AGI risk on the Bankless podcast

 |   |  Analysis, Conversations

Eliezer gave a very frank overview of his take on AI two weeks ago on the cryptocurrency show Bankless: 

I’ve posted a transcript of the show and a follow-up Q&A below.

Thanks to Andrea_Miotti, remember, and vonk for help posting transcripts.


Eliezer Yudkowsky: [clip] I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world, but I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Read more »

Comments on OpenAI’s "Planning for AGI and beyond"

 |   |  Analysis, Conversations

Sam Altman shared me on a draft of his OpenAI blog post Planning for AGI and beyond, and I left some comments, reproduced below without typos and with some added hyperlinks. Where the final version of the OpenAI post differs from the draft, I’ve noted that as well, making text Sam later cut red and text he added blue.

My overall sense is that Sam deleted text and occasionally rephrased sentences so as to admit more models (sometimes including mine), but didn’t engage with the arguments enough to shift his own probability mass around on the important disagreements.

Our disagreements are pretty major, as far as I can tell. With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I’m guessing part of the problem is that I didn’t phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.

Read more »

Focus on the places where you feel shocked everyone’s dropping the ball

 |   |  Analysis

Writing down something I’ve found myself repeating in different conversations:

If you’re looking for ways to help with the whole “the world looks pretty doomed” business, here’s my advice: look around for places where we’re all being total idiots.

Look for places where everyone’s fretting about a problem that some part of you thinks it could obviously just solve.

Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.

Then do it better.

For a concrete example, consider Devansh. Devansh came to me last year and said something to the effect of, “Hey, wait, it sounds like you think Eliezer does a sort of alignment-idea-generation that nobody else does, and he’s limited here by his unusually low stamina, but I can think of a bunch of medical tests that you haven’t run, are you an idiot or something?” And I was like, “Yes, definitely, please run them, do you need money”.

I’m not particularly hopeful there, but hell, it’s worth a shot! And, importantly, this is the sort of attitude that can lead people to actually trying things at all, rather than assuming that we live in a more adequate world where all the (seemingly) dumb obvious ideas have already been tried.

Read more »