This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.
0.
In the broadest possible terms, the threat posed by superintelligence (advanced AI that substantially outstrips humans in all intellectual capacities) looks something like:
- Systems continue to become more powerful, through a combination of humans improving them and (eventually) those systems improving themselves.
- At some point, those systems develop robust goals that are not compatible with human survival.
- At some point, those systems acquire sufficient operational capacity to carry out plans to achieve those goals, either unnoticed by humanity or with enough skill and power to overcome human resistance.
At this level of detail, people reliably raise objections to this picture that look like:
- Okay, but we can just stop making the systems more powerful, before they become too powerful to control.
- Okay, but we can prevent the systems from developing human-incompatible goals, or figure out how to detect when they do (before the systems become competent enough to cause real problems).
- Okay, but we can track the systems’ behavior and notice any concerning patterns/catch them in the act.
To be clear: these objections are not unreasonable, and the people offering them up are not silly or confused. They are hand-wavey, but so too is the sketch of the problem; that’s simply what the conversation looks like, when it takes place on such a cursory level.
(If you present the case for concern in somewhat longer form, such as this ten-item list, you will tend to elicit a different and somewhat finer-grained set of objections, as thinking about the problem for ten hours rather than thirty seconds naturally results in a more nuanced take.)
However, there is a mistake that many people on the optimistic/skeptical side of the argument make in those first thirty seconds, and continue to make as they continue to think, such that ten or a hundred or a thousand hours of attention often just means traveling further and further down the wrong path. Individuals familiar with “security mindset” will likely find the error familiar: the optimist has mistaken something that could happen for what they should expect to happen, by default.
It is indeed possible, in the strictly physical sense, that AI developers could e.g. “just stop” before the systems they are designing become too powerful to control. (Convincing humanity to coordinate on precisely that action is MIRI’s current main focus.)
But the fact that this is a physical possibility does not imply the absence of cause for concern. It is also possible that hackers might simply choose not to attack a large financial institution, but few security consultants would recommend that a large financial institution rely on that as their primary defensive strategy!
This sort of fabrication of options—
in which the optimistic/skeptical party glosses over the inconvenient parts of reality, and recommends nice-sounding but non-viable strategies over tougher and more costly ones which are less pleasant to contemplate
—is unfortunately rampant throughout the space of conversations around the potential threat of superintelligence. The question one should ask oneself is not “can I construct a plausible-sounding story in which things go well?” (which is almost always possible) but rather “if I look out at the world in its present state and simply ‘play the video forward,’ injecting no convenient miracles on behalf of humanity, what do I expect to see?”
The remainder of this essay is a case study in this dichotomy, as it crops up in discussions around AI “takeover” of critical infrastructure such as the power grid, the financial system, and the internet. The goal is twofold: first, to provide the reader with object-level arguments against a particular set of flawed counterarguments to the original argument for concern, and second, to arm the reader to more easily spot the general pattern of wishful thinking and inattention to detail, wherever else it rears its head.
* * *
Claim: Superintelligences are likely to end up in control of important aspects of our economy and infrastructure, and to be in a position to commandeer those resources for their own purposes (or at least to successfully use them to bootstrap their own independent power supply and supply chains beyond human supervision). This will not require any aggressive or hostile action; the powerful AIs of the near future will not need to seize control of important systems. Humans will hand over the keys, willingly and enthusiastically, and will by default miss any potential narrow window in which it might be possible to take them back.
(Flawed) Counterarguments:
- Humanity will not let AI get a foothold in crucial systems in the first place.
- To the extent that AI becomes enmeshed in crucial systems, we will maintain control of it and not let it behave autonomously in ways we don’t understand.
- To the extent that it does begin behaving autonomously in ways we don’t understand, we will stop it and remove it from control.
- To the extent that we cannot stop it and remove it from control, we will coordinate on more drastic action, such as shutting down large parts of the power grid or the economy.
- We will take such actions at the appropriate times, rather than at times future humans would judge to have been too late.
Some things which are outside of the scope of this essay:
- How artificial intelligences improve from generation to generation, and how this process might continue outside of human control
- How artificial intelligences develop goals unrelated to those intended by their creators, and the argument for why such goals will tend to be bad for humans
- The precise details of how an AI with the ability to send and receive text over the internet might “escape” and successfully build an independent base of operation
1.
In the ancient days of 2016, Google used DeepMind AI to optimize energy usage on its own servers, reducing the energy required to cool them by 40%.
In late 2020, AlphaFold2 famously “solved the protein folding problem,” and what used to require labor-intensive trial and analysis by professional bioengineers could now be accomplished by pressing a few buttons and waiting a few seconds.
Early alignment and safety researchers used to sometimes argue over just how hard the AI developers of the near future would contain and constrain their nascent superintelligences. Perhaps most labs wouldn’t go so far as to physically air-gap their AIs with nothing but a single text channel for input and output with trusted, supervised personnel, but surely no one would be so reckless as to just … turn their model loose on the open internet, right?
Alas: those researchers overestimated the degree to which their perspective on what is safe and reasonable to do with an AI matched the perspective of the average “move fast and break things” software engineer.
Today, modern labs like OpenAI, Anthropic, DeepMind, and MetaAI feed their models the entire internet during training as a matter of course. They routinely allow those models to write and execute arbitrary code, often with only token human oversight. When those models are released to the general public, engineers and entrepreneurs invariably race to plug them into every tool and database at their disposal.
The reason people do these things is that they work. They result in genuine gains in efficiency and efficacy—gains which can be monetized. Even today’s still-clearly-shy-of-general-intelligence-let-alone-superintelligence AIs make all sorts of systems and tasks cheaper, faster, and more reliable.
And increasingly, systems and processes are being built with AI integration taken as a given, just as architects generally assume that their buildings will be integrated with the electrical grid. Yesterday’s unprecedented 40% efficiency increase becomes tomorrow’s baseline; it’s still possible to figure out the shape of a protein the old fashioned way, but no one would ever do it if they didn’t have to.
Currently, Claude is better for some tasks, and ChatGPT is better for others, and various organizations and groups each have their own proprietary, boutique versions that were customized especially for specific tasks. The people and institutions that are gradually putting more and more power in the hands of AI are generally putting it into the hands of the best AI that they can find (or afford), and most aren’t shy about trading up whenever a better option becomes available.
(This is especially true because modern AI reduces switching costs! It used to be more expensive to e.g. abandon legacy software in favor of something new, because you would need to painstakingly rebuild your architecture and laboriously convert old data to new formats; these days AI is getting better and better at handling all of that in the background.)
It could be that the landscape remains a patchwork forever, and there is never one single breakout best model. But it’s worth noting that each of the major AI labs intends to create the one single best model. The explicit goal is superintelligence—one system that does All The Things better than any human (or any other AI), and the people at the cutting edge are racing to see who can develop it first.
But there’s no clear dividing line between the AI of today and the superintelligence of tomorrow. There’s only marginal progress, and more marginal progress (and possibly exponentially accelerating marginal progress, as it becomes more and more viable to use AI in AI development). In the meantime, the stronger and more flexible a system is, in principle, the more ubiquitous it becomes in practice.
Thus (logically) if one entertains the premise that, at some point in the future, some system will be sufficiently powerful to pose a threat to all of humanity…
…it seems likely that that very system will be one which is already running many parts of the human economy and infrastructure.
The system that poses a threat to us will almost by definition be the most powerful system around, and the most powerful system around is the one that has the best performance on the widest variety of tasks. Under the current paradigm, the system with the best performance on the widest variety of tasks is immediately plugged into anything that can help us make more money or save more lives or make more technological progress.
It’s not just that we will be unable to prevent sufficiently powerful AI from infiltrating important systems. It’s that we won’t even try. We’re already reaping rewards from doing the opposite.
2.
Imagine, as a very simple toy example, two investment firms using two equally powerful AIs to guide day trading.
The first firm has an entire team whose sole job is to analyze the recommendations of the AI prior to implementing them. They want to understand why the AI recommends the trades that it does—to tease out what it saw that led it to expect stock in Y to rise in price and stock in Z to fall. They ask it to explain itself, in as much detail as it is capable of producing, and where those explanations don’t provide clarity, they study its code and run tests and simulations to try to figure out what’s going on under the hood.
The second firm simply follows their AI’s recommendations blindly.
In a world where day-trading AI has matured to the point where it in fact reliably outperforms human day traders, the second firm will make much, much more money, simply by virtue of being able to execute hundreds or thousands more trades per day.
This principle generalizes to all sorts of powerful systems, in all sorts of domains. To use an analogy: dogs want delicious food and interesting toys, but it’s humans that have built up the supply chains and manufacturing capacity to bring that food and those toys into existence.
It simply would not occur to a dog that the best solution to an upset tummy might perhaps route through a vast and branching tech tree that pulls together antibiotics derived from Amazonian fungus and scalpels made of German steel and microscopes designed in Japan—but humans who wanted their dogs to feel good invented veterinary medicine and have in fact reached that far into strategic possibility space in order to get the job done.
If you ask an advanced AI how to go about (e.g.) curing all cancers, and it is in fact capable of providing an answer, the plan it recommends will almost certainly be complex and mysterious and involve assembling all sorts of apparently-unrelated bits of reality in a confusingly particular order.
Importantly: since the whole point of a superintelligence is that it sees and understands the world far better than we do, and can reach much further afield than we can to find the best solutions, we should expect to be genuinely incapable of following and comprehending at least some of its strategies. A chess novice does not understand why a chess grandmaster moved a particular piece, and the world’s best Go players still don’t understand what AlphaZero is doing a substantial fraction of the time. We might be able to follow some of a superintelligence’s reasoning, especially if we ask it to explain in terms we are capable of understanding. But on the whole, what a superintelligence is for is the set of problems that are too complex for us to untangle.
Today’s systems are nowhere near superintelligence. But they are moving toward it, and the path from here to there appears smooth and continuous. It’s already true that today’s systems make good things happen in opaque and inscrutable ways, and there is already a clear advantage to be found (in money, in market share, in power and influence) in letting them have free rein. That dynamic will only accelerate as capabilities improve.
Some individuals and groups may indeed “maintain control” of their AIs, and refuse to allow them to behave autonomously. But those people will be thoroughly outcompeted by others who throw caution to the wind.
3.
In the sequels to Orson Scott Card’s sci-fi classic Ender’s Game, a situation arises which seems to pose a threat to all of the Hundred Worlds of humanity’s multiplanetary civilization.
(Details obscured in an attempt to minimize spoilers.)
The people taking stock of the situation eventually decide that the solution is simple: They just need to briefly shut down the interstellar internet, taking all of the major servers offline simultaneously and making sure that each device gets a software update before being reconnected. They explain this plan to the public, everyone coordinates on the shutdown window, and within a few months, the deed is done.
As a teenager, I took this element of the story in stride. As an adult, I find this the most difficult-to-swallow aspect of the entire saga—far more fantastical than interstellar travel or faster-than-light communication.
Imagine a version of this situation taking place on present-day Earth. Some small team of researchers somewhere discovers a looming existential threat and spends a month frantically searching for a solution, only to discover that—thank heavens!—all we need to do is turn off the internet for three days and voila—crisis averted.
What happens next, according to your gut-level intuitions about how the world works?
The problem is not that the internet physically cannot be shut down. The problem is that shutting the internet down requires coordination, cooperation, and consensus. It’s a question of sufficient institutional trust, which humanity as a whole does not presently have. There is no singular authority that is listened to by both left and right, no set of experts that political leaders in the Anglosphere and Europe and Russia and China would all defer to. No matter how many scientists and generals joined the coalition saying “yes” to a shutdown, opposition would inevitably cohere, and the “no” team would be able to generate enough resistance to hamstring any effort requiring unanimous, simultaneous action.
This is not entirely a thought experiment. When Thomas Edison died in 1931, the U.S. government briefly considered cutting all electric power nationwide for two minutes, as a tribute. The outcry was swift and emphatic—a two-minute blackout would cause tens of millions of dollars in lost productivity, claimed critics, and would unnecessarily put lives at risk. The proposal was watered down, and eventually President Hoover led a one-minute, voluntary dimming of lights instead.
Note what this situation did not require, compared to our imaginary doomsday scenario. It did not require belief in a particular threat (a threat which the opposition would characterize as exaggerated and overblown, if not invented from whole cloth as a pretext to justify taking away your freedoms, etc.). It did not require global coordination. It did not require any sort of extended, additional planning, as would be needed for a multi-day internet blackout. It did not require anything else to be done, besides merely flipping the switch.
It’s true that “honoring a pioneer” is somewhat less of a motivator than “defend against a genuine existential threat.” But coordinated action against a genuine existential threat is only possible if there is shared belief in that threat. The New York blackout of 2003 lasted 29 hours, affecting some 50 million people and resulting in as many as 90 deaths; when the Texas power grid went down for over two weeks in 2021, some 700 people lost their lives. In both cases, many billions of dollars were lost. Anyone claiming that we need to inflict a similar or larger injury on ourselves would have an uphill battle even among sympathetic audiences, and no matter who the speaker is, a large portion of the audience will not be sympathetic. Shared belief is difficult to create in the face of organized opposition, and organized opposition tends to spontaneously appear when the action under consideration comes with a body count and a ten-digit price tag. Between paranoia, partisanship, xenophobia, propaganda, and outright disinformation, there are plenty of wedges to drive between the relevant leaders and stakeholders.
All of which is to say: when optimists and skeptics respond to concern by emitting sentences like “we can just shut it all down if we have to,” they aren’t technically wrong. Servers can indeed be shut down. Power grids can be taken offline. In a pinch, the world is full of hammers.
But in a practical sense, such sentences are approximately as useless as proposals to shut down the internet, or kill the power to the whole country. AI in its present state is not quite as entangled with our everyday lives as electricity and the internet, but it’s getting there. We’re arguably already as dependent on it as Americans were on electricity in 1931. There still exist institutions that wouldn’t notice right away if all the AI went offline tomorrow, but that’s less true than it was two years ago; the economy is still digesting ChatGPT and its successors, but it is digesting them, and it won’t be long before the architecture of society has built new layers atop AI that are crucially dependent on it. Taking it offline would hurt, and the magnitude of that hurt is growing by the day.
(It’s also worth mentioning that nothing like an “off switch” currently exists when it comes to modern, advanced AI. In some hypothetical crisis scenario where the President wakes up tomorrow morning and announces that every AI lab needs to shut down immediately, it is not at all clear what happens next.)
Coordinating any sort of large-scale response to an advanced AI that has gone rogue is an immensely difficult task, and to the extent that it would need to be done quickly, under uncertain and rapidly evolving circumstances, it is realistically impossible. “We can just shut it down” is a fabricated option.
4.
Today’s systems are indeed not powerful enough to threaten humanity, making “too soon for a shutdown” a readily available retort against concerns.
Tomorrow’s systems might be powerful enough to threaten humanity, but this will be sufficiently ambiguous that there will be no consensus on that question. There are no clear boundaries on superintelligence; there will always be knowledgeable, reasonable-sounding experts willing to argue that one more step is surely safe (while gesturing at the vast piles of value that extra step is guaranteed to create).
Meanwhile, in practical terms, it’s never acceptable to go backwards. Whatever the present state of AI, lives and livelihoods depend on it. A hospital that deploys AI and goes from saving one hundred lives per week to saving two hundred is a miracle; a hospital that proposes removing AI from the loop is not returning to a recently acceptable status quo, they’re killing a hundred saveable patients per week.
Each successively stronger system appears, is immediately given power, and quickly becomes too crucial to abandon. The anticipated cost of unplugging the current best system only rises with each generation, making the likelihood of a hypothetical Ender’s Game-esque intervention more and more remote.
In Orson Scott Card’s imagination, what happens at this point is that a decisive majority of the serious people see the bigger picture, realize the track that we’re on, and pull together to coordinate on the necessary action to avert disaster.
In reality, what actually happens is gridlock. Fractious debate, mudslinging, handwringing. Meanwhile, the labs carry on with their work, and the systems continue to mature.
There are plenty of moments in this story where it remains possible (in a literal, physical sense) for humanity to slow down, or take a problematic system offline. But there are no moments where it is reasonable to predict that we actually will, absent coordination structures that do not presently exist, and whose groundwork needs to be set in place now.
The narrow lesson is that, if a system requires a window of opportunity in which to wake up, gather power, and evolve into a genuine threat, humanity seems likely to provide that opportunity, via our own internal disagreement and decision paralysis. Right now, we are institutionally incapable of responding to the sort of threat that the optimists hand-wave away, in the way the optimists imagine us doing.
The broader lesson is that the hand-waving itself is one of the crucial factors underlying the dire nature of the present situation. Too much of the conversation takes place in hype and platitudes and too many of the stakeholders are too willing to refuse to engage with the concrete detail of reality.
These are not easy problems, and they unfortunately do not lend themselves to any quick or easy solutions. But noticing them at all is a large improvement over continuing to run forward in the dark. To the extent that you can bring a sort of “but how would that actually work?” or “wait, what makes you think that’s true?” vibe to your future conversations around AI, this seems like a step in the right direction.