AI as a science, and three obstacles to alignment strategies

 |   |  Analysis

AI used to be a science. In the old days (back when AI didn’t work very well), people were attempting to develop a working theory of cognition.

Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.

The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn’t teach us that there’s nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.

Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.

Viewing Earth’s current situation through that lens, I see three major hurdles:

  1. Most research that helps one point AIs, probably also helps one make more capable AIs. A “science of AI” would probably increase the power of AI far sooner than it allows us to solve alignment.
  2. In a world without a mature science of AI, building a bureaucracy that reliably distinguishes real solutions from fake ones is prohibitively difficult.
  3. Fundamentally, for at least some aspects of system design, we’ll need to rely on a theory of cognition working on the first high-stakes real-world attempt.

I’ll go into more detail on these three points below. First, though, some background:



By the time AIs are powerful enough to endanger the world at large, I expect AIs to do something akin to “caring about outcomes”, at least from a behaviorist perspective (making no claim about whether it internally implements that behavior in a humanly recognizable manner).

Roughly, this is because people are trying to make AIs that can steer the future into narrow bands (like “there’s a cancer cure printed on this piece of paper”) over long time-horizons, and caring about outcomes (in the behaviorist sense) is the flip side of the same coin as steering the future into narrow bands, at least when the world is sufficiently large and full of curveballs.

I expect the outcomes that the AI “cares about” to, by default, not include anything good (like fun, love, art, beauty, or the light of consciousness) — nothing good by present-day human standards, and nothing good by broad cosmopolitan standards either. Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.

(Similar to how the human genome was naturally selected for inclusive genetic fitness, but the resultant humans didn’t end up with a preference for “whatever food they model as useful for inclusive genetic fitness”. Instead, humans wound up internalizing a huge and complex set of preferences for “tasty” foods, laden with complications like “ice cream is good when it’s frozen but not when it’s melted”.)

Separately, I think that most complicated processes work for reasons that are fascinating, complex, and kinda horrifying when you look at them closely.

It’s easy to think that a bureaucratic process is competent until you look at the gears and see the specific ongoing office dramas and politicking between all the vice-presidents or whatever. It’s easy to think that a codebase is running smoothly until you read the code and start to understand all the decades-old hacks and coincidences that make it run. It’s easy to think that biology is a beautiful feat of engineering until you look closely and find that the eyeballs are installed backwards or whatever.

And there’s an art to noticing that you would probably be astounded and horrified by the details of a complicated system if you knew them, and then being astounded and horrified already in advance before seeing those details.[1]


1. Alignment and capabilities are likely intertwined

I expect that if we knew in detail how LLMs are calculating their outputs, we’d be horrified (and fascinated, etc.).

I expect that we’d see all sorts of coincidences and hacks that make the thing run, and we’d be able to see in much more detail how, when we ask the system to achieve some target, it’s not doing anything close to “caring about that target” in a manner that would work out well for us, if we could scale up the system’s optimization power to the point where it could achieve great technological or scientific feats (like designing Drexlerian nanofactories or what-have-you).

Gaining this sort of visibility into how the AIs work is, I think, one of the main goals of interpretability research.

And understanding how these AIs work and how they don’t — understanding, for example, when and why they shouldn’t yet be scaled or otherwise pushed to superintelligence — is an important step on the road to figuring out how to make other AIs that could be scaled or otherwise pushed to superintelligence without thereby causing a bleak and desolate future.

But that same understanding is — I predict — going to reveal an incredible mess. And the same sort of reasoning that goes into untangling that mess into an AI that we can aim, also serves to untangle that mess to make the AI more capable. A tangled mess will presumably be inefficient and error-prone and occasionally self-defeating; once it’s disentangled, it won’t just be tidier, but will also come to accurate conclusions and notice opportunities faster and more reliably.[2]

Indeed, my guess is that it’s even easier to see all sorts of things that the AI is doing that are dumb, all sorts of ways that the architecture is tripping itself up, and so on.

Which is to say: the same route that gives you a chance of aligning this AI (properly, not the “it no longer says bad words” superficial-property that labs are trying to pass off as “alignment” these days) also likely gives you lots more AI capabilities.

(Indeed, my guess is that the first big capabilities gains come sooner than the first big alignment gains.)

I think this is true of most potentially-useful alignment research: to figure out how to aim the AI, you need to understand it better; in the process of understanding it better you see how to make it more capable.

If true, this suggests that alignment will always be in catch-up mode: whenever people try to figure out how to align their AI better, someone nearby will be able to run off with a few new capability insights, until the AI is pushed over the brink.

So a first key challenge for AI alignment is a challenge of ordering: how do we as a civilization figure out how to aim AI before we’ve generated unaimed superintelligences plowing off in random directions? I no longer think “just sort out the alignment work before the capabilities lands” is a feasible option (unless, by some feat of brilliance, this civilization pulls off some uncharacteristically impressive theoretical triumphs).

Interpretability? Will likely reveal ways your architecture is bad before it reveals ways your AI is misdirected.

Recruiting your AIs to help with alignment research? They’ll be able to help with capabilities long before that (to say nothing of whether they would help you with alignment by the time they could, any more than humans would willingly engage in eugenics for the purpose of redirecting humanity away from Fun and exclusively towards inclusive genetic fitness).

And so on.

This is (in a sense) a weakened form of my answer to those who say, “AI alignment will be much easier to solve once we have a bona fide AGI on our hands.” It sure will! But it will also be much, much easier to destroy the world, when we have a bona fide AGI on our hands. To survive, we’re going to need to either sidestep this whole alignment problem entirely (and take other routes to a wonderful future instead, as I may discuss more later), or we’re going to need some way to do a bunch of alignment research even as that research makes it radically easier and radically cheaper to destroy everything of value.

Except even that is harder than many seem to realize, for the following reason.


2. Distinguishing real solutions from fake ones is hard

Already, labs are diluting the word “alignment” by using the word for superficial results like “the AI doesn’t say bad words”. Even people who apparently understand many of the core arguments have apparently gotten the impression that GPT-4’s ability to answer moral quandaries is somehow especially relevant to the alignment problem, and an important positive sign.

(The ability to answer moral questions convincingly mostly demonstrates that the AI can predict how humans would answer or what humans want to hear, without revealing much about what the AI actually pursues, or would pursue upon reflection, etc.)

Meanwhile, we have little idea of what passes for “motivations” inside of an LLM, or what effect pretraining on next-token prediction and fine-tuning with RLHF really has on the internals. This sort of precise scientific understanding of the internals — the sort that lets one predict weird cognitive bugs in advance — is currently mostly absent in the field. (Though not entirely absent, thanks to the hard work of many researchers.)

Now imagine that Earth wakes up to the fact that the labs aren’t going to all decide to stop and take things slowly and cautiously at the appropriate time.[3] And imagine that Earth uses some great feat of civilizational coordination to halt the world’s capabilities progress, or to otherwise handle the issue that we somehow need room to figure out how these things work well enough to align them. And imagine we achieve this coordination feat without using that same alignment knowledge to end the world (as we could). There’s then the question of who gets to proceed, under what circumstances.

Suppose further that everyone agreed that the task at hand was to fully and deeply understand the AI systems we’ve managed to develop so far, and understand how they work, to the point where people could reverse out the pertinent algorithms and data-structures and what-not. As demonstrated by great feats like building, by-hand, small programs that do parts of what AI can do with training (and that nobody previously knew how to code by-hand), or by identifying weird exploits and edge-cases in advance rather than via empirical trial-and-error. Until multiple different teams, each with those demonstrated abilities, had competing models of how AIs’ minds were going to work when scaled further.

In such a world, it would be a difficult but plausibly-solvable problem, for bureaucrats to listen to the consensus of the scientists, and figure out which theories were most promising, and figure out who needs to be allotted what license to increase capabilities (on the basis of this or that theory that predicts this would be non-catastrophic), so as to put their theory to the test and develop it further.

I’m not thrilled about the idea of trusting an Earthly bureaucratic process with distinguishing between partially-developed scientific theories in that way, but it’s the sort of thing that a civilization can perhaps survive.

But that doesn’t look to me like how things are poised to go down.

It looks to me like we’re on track for some people to be saying “look how rarely my AI says bad words”, while someone else is saying “our evals are saying that it can’t deceive humans yet”, while someone else is saying “our AI is acting very submissive, and there’s no reason to expect AIs to become non-submissive, that’s just anthropomorphizing”, and someone else is saying “we’ll just direct a bunch of our AIs to help us solve alignment, while arranging them in a big bureaucracy”, and someone else is saying “we’ve set up the game-theoretic incentives such that if any AI starts betraying us, some other AI will alert us first”, and this is a different sort of situation.

And not one that looks particularly survivable, to me.

And if you ask bureaucrats to distinguish which teams should be allowed to move forward (and how far) in that kind of circus, full of claims, promises, and hunches and poor in theory, then I expect that they basically just can’t.

In part because the survivable answers (such as “we have no idea what’s going on in there, and will need way more of an idea what’s going on in there, and that understanding needs to somehow develop in a context where we can do the job right rather than simply unlocking the door to destruction”) aren’t really in the pool. And in part because all the people who really want to be racing ahead have money and power and status. And in part because it’s socially hard to believe, as a regulator, that you should keep telling everyone “no”, or that almost everything on offer is radically insufficient, when you yourself don’t concretely know what insights and theoretical understanding we’re missing.

Maybe if we can make AI a science again, then we’ll start to get into the regime where, if humanity can regulate capabilities advancements in time, then all the regulators and researchers understand that you shall only ask for a license to increase the capabilities of your system when you have a full detailed understanding of the system and a solid justification for why you need the capabilities advance and why it’s not going to be catastrophic. At which point maybe a scientific field can start coming to some sort of consensus about those theories, and regulators can start being sensitive to that consensus.

But unless you can get over that grand hump, it looks to me like one of the key bottlenecks here is bureaucratic legibility of plausible solutions. Where my basic guess is that regulators won’t be able to distinguish real solutions from false ones, in anything resembling the current environment.

Together with the above point (“alignment and capabilities are likely intertwined”), I think this means that our rallying cry should be less “pause to give us more time on alignment research” and more “stop entirely, and find some way to circumvent these woods entirely; we’re not equipped to navigate them”.

(With a backup rallying cry of “make AI a science again”, though again, that only works if you have some way of preventing the science-of-mind from leading to catastrophe before we figure out how to build AIs that care about good stuff rather than bleak and desolate stuff.)


3. Most theories don’t work on the first real try

It seems worth noting that even if you manage to surmount the above two problems, you have a third problem on your hands, which is that when it finally comes time, not to increase your system’s capabilities a bit more (and test your theory of cognition further), but rather to actually do something of note with your AI, then you must, in some sense, be trusting a novel and untested scientific theory (and your engineering efforts grounded in that theory) to work perfectly on the first critical try.

In particular, once your AI is capable of autonomous scientific/technological development, and operating in a domain where it could use that ability to gain a decisive strategic advantage over the rest of the planet, it’s operating in a fundamentally different cognitive regime than it ever was in training.

By analogy, if you’re considering making Bob the dictator of your country, you might first make him mock-dictator of your town, and watch to make sure he doesn’t abuse his power. But, try as you might, that’s still not fundamentally all that reliable a test of whether he’ll actually abuse power once he really has it. Once he really can give orders to the army, once he really can take advantage of the country’s populace, will he do so? The option of “abuse my mock power while being observed by my townspeople (who can still beat me up or deny me the job)” is actually very cognitively different than the option “order the army to intimidate the parliament and ‘oversee’ the next election”.

Now, given a sufficiently well-developed theory of cognition, you can try to read the AI’s mind and predict what cognitive state it would enter if it did really think it had those options. And you can set up simulations (and try to spoof its internal sensations and so on) in a way that your theory of cognition predicts is very similar to the cognitive state it would enter once it really had the option to betray you.

But the link between these states that you induce and observe in the lab, and the actual state where the AI actually has the option to betray you, depends fundamentally on your fresh new theory of cognition.

Actually running the AI until it really has the opportunity to betray you is an empirical test of those theories in an environment that differs fundamentally from the lab setting.

And many a scientist (and programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go super well on the first try.

As a concrete analogy to potentially drive this point home: Newtonian mechanics made all sorts of shockingly-good empirical predictions. It was a simple concise mathematical theory with huge explanatory power that blew every previous theory out of the water. And if you were using it to send payloads to very distant planets at relativistic speeds, you’d still be screwed, because Newtonian mechanics does not account for relativistic effects.

(And the only warnings you’d get would be little hints about light seeming to move at the same speed in all directions at all times of year, and light bending around the sun during eclipses, and the perihelion of Mercury being a little off from what Newtonian mechanics predicted. Small anomalies, weighed against an enormous body of predictive success in a thousand empirical domains; and yet Nature doesn’t care, and the theory still falls apart when we move to energies and scales far outside what we’d previously been able to observe.)

Getting scientific theories to work on the first critical try is hard. (Which is one reason to aim for minimal pivotal tasks — getting a satellite into orbit should work fine on Newtonian mechanics, even if sending payloads long distances at relativistic speeds does not.)

Worrying about this issue is something of a luxury, at this point, because it’s not like we’re anywhere close to scientific theories of cognition that accurately predict all the lab data. But it’s the next hurdle on the queue, if we somehow manage to coordinate to try to build up those scientific theories, in a way where success is plausibly bureaucratically-legible.

Maybe later I’ll write more about what I think the strategy implications of these points are. In short, I basically recommend that Earth pursue other routes to the glorious transhumanist future, such as uploading. (Which is also fraught with peril, but I expect that those perils are more surmountable; I hope to write more about this later.)


  1. Albeit slightly less, since there’s nonzero prior probability on this unknown system turning out to be simple, elegant, and well-designed.
  2. An exception to this guess happens if the AI is at the point where it’s correcting its own flaws and improving its own architecture, in which case, in principle, you might not see much room for capabilities improvements if you took a snapshot and comprehended its inner workings, despite still being able to see that the ends it pursues are not the ones you wanted. But in that scenario, you’re already about to die to the self-improving AI, or so I predict.
  3. Not least because there are no sufficiently clear signs that it’s time to stop — we blew right past “an AI claims it is sentient”, for example. And I’m not saying that it was a mistake to doubt AI systems’ first claims to be sentient — I doubt that Bing had the kind of personhood that’s morally important (though I am by no means confident!). I’m saying that the thresholds that are clear in science fiction stories turn out to be messy in practice and so everyone just keeps plowing on ahead.