Truth and Advantage: Response to a draft of “AI safety seems hard to measure”

 |   |  Analysis

Status: This was a response to a draft of Holden’s cold take “AI safety seems hard to measure”. It sparked a further discussion, that Holden recently posted a summary of.

The follow-up discussion ended up focusing on some issues in AI alignment that I think are underserved, which Holden said were kinda orthogonal to the point he was trying to make, and which didn’t show up much in the final draft. I nevertheless think my notes were a fine attempt at articulating some open problems I see, from a different angle than usual. (Though it does have some overlap with the points made in Deep Deceptiveness, which I was also drafting at the time.)

I’m posting the document I wrote to Holden with only minimal editing, because it’s been a few months and I apparently won’t produce anything better. (I acknowledge that it’s annoying to post a response to an old draft of a thing when nobody can see the old draft, sorry.)

Quick take: (1) it’s a write-up of a handful of difficulties that I think are real, in a way that I expect to be palatable to a relevant different audience than the one I appeal to; huzzah for that. (2) It’s missing some stuff that I think is pretty important.

Read more »

Deep Deceptiveness

 |   |  Analysis


This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don’t recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.

You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there’s a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)

Caveat: I’ll be talking a bunch about “deception” in this post because this post was generated as a result of conversations I had with alignment researchers at big labs who seemed to me to be suggesting “just train AI to not be deceptive; there’s a decent chance that works”.[1]

I have a vague impression that others in the community think that deception in particular is much more central than I think it is, so I want to warn against that interpretation here: I think deception is an important problem, but its main importance is as an example of some broader issues in alignment.[2]

Caveat: I haven’t checked the relationship between my use of the word ‘deception’ here, and the use of the word ‘deceptive’ in discussions of “deceptive alignment“. Please don’t assume that the two words mean the same thing.

Investigating a made-up but moderately concrete story

Suppose you have a nascent AGI, and you’ve been training against all hints of deceptiveness. What goes wrong?

When I ask this question of people who are optimistic that we can just “train AIs not to be deceptive”, there are a few answers that seem well-known. Perhaps you lack the interpretability tools to correctly identify the precursors of ‘deception’, so that you can only train against visibly deceptive AI outputs instead of AI thoughts about how to plan deceptions. Or perhaps training against interpreted deceptive thoughts also trains against your interpretability tools, and your AI becomes illegibly deceptive rather than non-deceptive.

And these are both real obstacles. But there are deeper obstacles, that seem to me more central, and that I haven’t observed others to notice on their own.

That’s a challenge, and while you (hopefully) chew on it, I’ll tell an implausibly-detailed story to exemplify a deeper obstacle.

Read more »

Yudkowsky on AGI risk on the Bankless podcast

 |   |  Analysis, Conversations

Eliezer gave a very frank overview of his take on AI two weeks ago on the cryptocurrency show Bankless: 

I’ve posted a transcript of the show and a follow-up Q&A below.

Thanks to Andrea_Miotti, remember, and vonk for help posting transcripts.


Eliezer Yudkowsky: [clip] I think that we are hearing the last winds start to blow, the fabric of reality start to fray. This thing alone cannot end the world, but I think that probably some of the vast quantities of money being blindly and helplessly piled into here are going to end up actually accomplishing something.

Read more »

Comments on OpenAI’s "Planning for AGI and beyond"

 |   |  Analysis, Conversations

Sam Altman shared me on a draft of his OpenAI blog post Planning for AGI and beyond, and I left some comments, reproduced below without typos and with some added hyperlinks. Where the final version of the OpenAI post differs from the draft, I’ve noted that as well, making text Sam later cut red and text he added blue.

My overall sense is that Sam deleted text and occasionally rephrased sentences so as to admit more models (sometimes including mine), but didn’t engage with the arguments enough to shift his own probability mass around on the important disagreements.

Our disagreements are pretty major, as far as I can tell. With my comments, I was hoping to spark more of a back-and-forth. Having failed at that, I’m guessing part of the problem is that I didn’t phrase my disagreements bluntly or strongly enough, while also noting various points of agreement, which might have overall made it sound like I had only minor disagreements.

Read more »

Focus on the places where you feel shocked everyone’s dropping the ball

 |   |  Analysis

Writing down something I’ve found myself repeating in different conversations:

If you’re looking for ways to help with the whole “the world looks pretty doomed” business, here’s my advice: look around for places where we’re all being total idiots.

Look for places where everyone’s fretting about a problem that some part of you thinks it could obviously just solve.

Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.

Then do it better.

For a concrete example, consider Devansh. Devansh came to me last year and said something to the effect of, “Hey, wait, it sounds like you think Eliezer does a sort of alignment-idea-generation that nobody else does, and he’s limited here by his unusually low stamina, but I can think of a bunch of medical tests that you haven’t run, are you an idiot or something?” And I was like, “Yes, definitely, please run them, do you need money”.

I’m not particularly hopeful there, but hell, it’s worth a shot! And, importantly, this is the sort of attitude that can lead people to actually trying things at all, rather than assuming that we live in a more adequate world where all the (seemingly) dumb obvious ideas have already been tried.

Read more »

What I mean by “alignment is in large part about making cognition aimable at all”

 |   |  Analysis

(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)

I have long said that the lion’s share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.

It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.

In saying the above, I do not mean the following:

(1) Any practical AI that you’re dealing with will necessarily be cleanly internally organized around pursuing a single objective. Managing to put your own objective into this “goal slot” (as opposed to having the goal slot set by random happenstance) is a central difficult challenge. [Reminder: I am not asserting this] 

Instead, I mean something more like the following:

(2) By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out “goal” that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.

Making the AI even have something vaguely nearing a ‘goal slot’ that is stable under various operating pressures (such as reflection) during the course of operation, is an undertaking that requires mastery of cognition in its own right—mastery of a sort that we’re exceedingly unlikely to achieve if we just try to figure out how to build a mind, without filtering for approaches that are more legible and aimable.

Read more »

July 2022 Newsletter

 |   |  Newsletters

A central AI alignment problem: capabilities generalization, and the sharp left turn

 |   |  Analysis

(This post was factored out of a larger post that I (Nate Soares) wrote, with help from Rob Bensinger, who also rearranged some pieces and added some text to smooth things out. I’m not terribly happy with it, but am posting it anyway (or, well, having Rob post it on my behalf while I travel) on the theory that it’s better than nothing.)

I expect navigating the acute risk period to be tricky for our civilization, for a number of reasons. Success looks to me to require clearing a variety of technical, sociopolitical, and moral hurdles, and while in principle sufficient mastery of solutions to the technical problems might substitute for solutions to the sociopolitical and other problems, it nevertheless looks to me like we need a lot of things to go right.

Some sub-problems look harder to me than others. For instance, people are still regularly surprised when I tell them that I think the hard bits are much more technical than moral: it looks to me like figuring out how to aim an AGI at all is harder than figuring out where to aim it.[1]

Within the list of technical obstacles, there are some that strike me as more central than others, like “figure out how to aim optimization”. And a big reason why I’m currently fairly pessimistic about humanity’s odds is that it seems to me like almost nobody is focusing on the technical challenges that seem most central and unavoidable to me.

Many people wrongly believe that I’m pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That’s flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.[2]

I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it’s somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it’s somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it’s all that qualitatively different than the sorts of summits humanity has surmounted before.

It’s made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope. Read more »