MIRI Briefing on Extinction Risk
from AI
September 2024 – PDF version
I. The development of artificial superintelligence poses an imminent risk of human extinction.
“Artificial superintelligence” (ASI) refers to AI that can substantially surpass humans in all strategically relevant activities (economic, scientific, military, etc.).
The timeline to ASI is uncertain, but probably not long. On the present trajectory, MIRI would be uncomfortable ruling out the possibility that ASI is developed in the next year or two, and we’d be surprised if it was still several decades away.1
AI labs are aggressively scaling up systems they don’t understand. The Deep Learning techniques behind the rapid AI progress of the last few years create massive neural networks automatically. The resulting models are vast human-unreadable tangles of machine-written operations more “grown” than designed.2 Labs basically discovered a “cheat”: Engineers can’t tell you why a modern AI makes a given choice, but have nevertheless released increasingly capable systems year after year.3
Sufficiently intelligent AIs will likely develop persistent goals of their own. Humans have wants, and make long-term plans, for reasons that we expect to also apply to sufficiently smart mechanically-grown AIs. (The computer science of this prediction does not fit into a paragraph; inquire further if interested.) We are only barely starting to see this phenomenon in today’s AIs, which require a long training process to hammer them into the apparently-obedient form the public is allowed to see.4
We expect the the ASI’s goals to be hollow and lifeless in the end. Imbuing a superhumanly intelligent AI with a deep, persistent care for worthwhile objectives is much more difficult than training it to answer the right way on an ethics test.5 Having spent two decades on the serious version of this problem, our informed view is that the field is nowhere near a solution.6
ASI that doesn’t value us will end us. Unless it has worthwhile goals7, ASI will put our planet to uses incompatible with our continued survival, just as we fail to concern ourselves with the crabgrass growing on the site of a planned parking lot.8 No malice, resentment, or misunderstanding is needed to precipitate our extinction.9
II. Human survival likely depends on delaying the creation of ASI as soon as we can for as long as necessary.
A “wait and see” approach to ASI is probably not survivable. A superintelligent adversary will not reveal its full capabilities and telegraph its intentions.10 It will not offer a fair fight. It will make itself indispensable11 or undetectable until it can strike decisively and/or seize an unassailable strategic position.12
MIRI doesn’t see any viable quick fixes or workarounds to misaligned ASI. OpenAI admits that today’s most important methods of steering AI won’t scale to the superhuman regime.13 Attempts to restrain14 or deceive15 a superior intelligence are prone to fail for reasons both foreseeable and unforeseeable.16 Our own theoretical work suggests that plans to align ASI using unaligned AIs are similarly unsound.17 We also don’t think a well-funded crash program to solve alignment would be able to correctly identify solutions that won’t kill us.18 Our current view is that a safe way forward will likely require ASI to be delayed for a long time.19
Delaying ASI requires an effective worldwide ban on its development, and tight control over the factors of its production. This is a large ask20, but domestic oversight, mirrored by a few close allies, will not suffice. This is not a case where we just need the “right” people to build it before the “wrong” people do, as ASI is not a national weapon; it is a global suicide bomb.21 If anyone builds it, everyone dies.
To preserve the option of shutting down ASI development if or when the will is found, MIRI advocates promptly building the off-switch.22 The “off-switch” refers to the systems and infrastructure needed for the eventual enactment of a ban.23 It starts with identifying the relevant parties, tracking the relevant hardware, and requiring that advanced AI work take place within a limited number of monitored and secured locations. It extends to building out the protocols, plans, and chain of command to be followed in the event of a shutdown decision. As the off-switch could also provide resilience to more limited AI mishaps, we hope it will find broader near-term support than a full ban.24
An off-switch can only prevent our extinction from ASI if it has sufficient reach and is actually used to shut down development in time.25 If humanity is to survive this dangerous period, it will have to stop treating AI as a domain for international rivalry and demonstrate a collective resolve equal to the threat.
Endnotes
Teams of human-level AIs put to work improving AI could be expected to deliver ASI in short order. The leaders of leading labs have suggested in interviews that they are on track to create AGI (AI that effectively matches or exceeds the flexible, “general” intelligence of humans) by 2034 or earlier:
Demis Hassabis of Google DeepMind on Dwarkesh Patel’s podcast of 2024-02-28:
[24:38] HASSABIS: I will say that when we started DeepMind back in 2010, we thought of it as a 20-year project. And I think we’re on track actually, which is kind of amazing for 20-year projects because usually they’re always 20 years away. That’s the joke about whatever, quantum, AI, take your pick. But I think we’re on track. So I wouldn’t be surprised if we had AGI-like systems within the next decade.”
Dario Amodei of Anthropic on Ezra Klein’s podcast of 2024-04-12:
[1:02:28] AMODEI: Yeah, I think ASL-3 [AI Safety Level 3] could easily happen this year or next year. I think ASL-4* —
KLEIN: Oh, Jesus Christ.
AMODEI: No, no, I told you. I’m a believer in exponentials. I think A.S.L. 4 could happen anywhere from 2025 to 2028.
KLEIN: So that is fast.
AMODEI: Yeah, no, no, I’m truly talking about the near future here.”
[*Per Anthropic’s site: “ASL-4 and higher (ASL-5+) is not yet defined as it is too far from present systems, but will likely involve qualitative escalations in catastrophic misuse potential and autonomy.”]Sam Altman, on Joe Rogan’s podcast of 2024-06-27:
[1:18:22] ALTMAN: I remember talking with John Schulman, one of our co-founders early on and he was like “It’s going to be about a fifteen year project, and I was like, “Yeah, sounds about right to me.” And I’ve always sort of thought since then — now, I no longer think of AGI as quite the end point — but to get to the point to accomplish the thing we set out to accomplish, that would take us to 2030, 2031. That has felt to me like, all the way through, kind of a reasonable estimate with huge error bars, and I kind of think we’re on the trajectory I sort of would have assumed.”↩
The young field of “mechanistic interpretability” attempts to map a neural network’s configuration to its outputs. Despite some important recent breakthroughs, interpretability pioneers are quick to refute claims that we know what’s going on inside these systems.
Leo Gao, of OpenAI: “I think it is quite accurate to say we don’t understand how neural networks work.” (X post of June 16, 2024)
Neel Nanda, of Google Deepmind: “As lead of the Google DeepMind mech interp team, I strongly seconded. It’s absolutely ridiculous to go from ‘we are making interp progress’ to ‘we are on top of this’ or ‘x-risk won’t be an issue’.” (X post of June 22, 2024)
[The term ‘x-risk’ is used in these circles to refer to “existential” or “extinction” risk.] ↩
A few quick points about scaling:
An important paper from 2020 identified “scaling laws” for performance improvement as model size, training data, and computation are increased. It demonstrated that the laws had held for more than ten orders of magnitude as of the time it was written. These laws are generally acknowledged to have continued to hold into the present. Epoch AI tracks statistics and trends of this type.
We don’t know if current architectures, typified by large language models, can scale all the way to superintelligence. But AI companies are searching hard for alternative architectures, and the most recent state-of-the-art models are probably already working differently.
We acknowledge the engineering challenges of implementation at greater scale. We don’t claim that recent progress has come without effort, or that there won’t be any obstacles to future scaling.
We also acknowledge the rapid progress being made in algorithmic efficiency — making models more capable with less scaling. Unfortunately, the work making AI more efficient does not seem to be yielding any new insights about the underlying cognition or about how to make it safe.↩
For an example of what it looks like when companies do less of this hammering, see the infamous case of “Sydney”.↩
When minds are grown and shaped iteratively, like modern AIs are, they won’t wind up pursuing the objectives they’re trained to pursue. Instead, training very likely leads them to pursue unpredictable proxies of the training targets, which are brittle in the face of increasing intelligence. By analogy, humans pursue proxies of good nutrition, such as sweet and fatty flavors, which used to be reliable indicators of healthy eating, but that proxy was brittle in the face of the technology that allows us to invent Oreos.↩
The field has, however, identified a great many pitfalls to building AI in a survivable manner. We give a catalog in our paper “AGI Ruin: A List of Lethalities”.↩
We advise against clinging to hopes that an ASI that sees no intrinsic value in human flourishing will find instrumental value in keeping humans around.
It is more likely that humans are viewed as a liability, as they could make another ASI with rival desires.
Keeping humans around (e.g. because they “generate truths” and the AI “will find it useful to learn true facts”) is unlikely to be the most efficient solution to any problem that the AI has. (The AI could learn even more true facts by building more and more computing clusters until the temperature on the surface of the planet gets uninhabitably high.)
We expect any period where ASI needs humans to keep things running to be brief or nonexistent; if humanity does not just give it robots, we expect enough humans to collaborate with AI (out of goodwill, or in exchange for money, or because they’re manipulated, etc.) in developing an independent (and faster-running) technological foundations. Humans are quite slow compared to what is possible.
Economic laws about the positive-sum value of trade, even with partners who are worse at everything, are likely to break down when one party has the means to overpower the other and convert them into something more productive and subservient. Humanity didn’t keep trading with horses in the wake of the automobile; it converted them into glue.↩
We can’t predict exactly what an ASI will want, but we can predict it won’t be nice — similar to how we can’t predict exactly what lottery numbers will be drawn, but we can predict they won’t be our favorite numbers. Most goals are not specifically about human flourishing, and most goals can be better-accomplished with more resources, so AIs with strange goals are likely to consume the universe’s resources and put them towards the AI’s hollow ends. Picture the sun blotted out by orbiting solar collectors while the oceans boil with the waste heat of data centers and heavy industry.↩
Unclear instructions, the wrong instructions, or any instructions carried out in a maximalist manner certainly could lead to our extinction, but this is a different category of unsolved problem (“outer alignment”) that we don’t get to die from without first solving the more fundamental problem of getting AI to contain any particular specific goals on purpose (“inner alignment”).
Our goal must be for any ASI to pursue worthwhile ends in its own right, and training it to want good things (and to achieve them in training environments) is not the sort of process that instills worthwhile values (as discussed in endnote 5). Making an AI that targets a particular end result of someone’s purposeful choosing, rather than some hodgepodge of brittle proxies, is the problem of “inner alignment”.
If inner alignment were a solved problem we would start to face thorny questions about who should get AI first, and maybe some discussion of international competition would make sense. This would create yet another category of hazard to worry about. But right now, the world is in the regime where if anyone builds it, everyone dies.↩
There are a number of major obstacles to recognizing that a system is a threat before it has a chance to do harm, even to experts with direct access to its internals:
As discussed on page 1 and in endnote 2, humans can’t make much sense of AI internals at this time. The field of mechanistic interpretability is still in its infancy.
The internal machinery that could make an AI dangerous is the same machinery that makes it work at all. (What looks like “power seeking” in one context would be considered “good hustle” in another.) There are no dedicated ‘misalignment’ circuits.
Moreover, methods we might use during training to reject candidate AIs with thought patterns we consider dangerous could have the effect of driving such thoughts “underground”, ensuring that future precursors to danger that emerge during training would be ones we don’t know how to detect.
Trying to assess threat levels in advance of model creation is also effectively impossible at this time. While the scaling laws discussed in endnote 3 allow some abstract mathematical predictions about how a system of a given size will perform, we don’t know where the dangerous thresholds are, and we seem prone to disregarding possible warning signs. The first AIs to inch across thresholds (like, say, noticing they are in training and plotting to deceive their evaluators — see point ‘d’ below) are bad at these new skills, leading people to be dismissive that this behavior is in any way threatening or will become so as systems are further scaled. In the absence of stark red lines people can agree on, there is no “fire alarm” for AI risk, no matter how many warning indications we get.
The January 2024 “Sleeper Agents” paper by Anthropic’s testing team demonstrated that an AI given secret instructions in training was not only capable of keeping them secret during evaluations, but made strategic calculations (incompetently) about when to lie to its evaluators to maximize the chance of being released and able to execute those instructions. Apollo Research made similar findings with regards to OpenAI’s o1-preview model released in September 2024 (as described in their contributions to the o1-preview system card, p.10↩
Making itself indispensable may take little effort, as humans are rushing to couple AI to the entire economy and build billions of robots to attach to it. (See, for example, Elon Musk’s public comments.) We are also socially tying ourselves to AI, with tens of millions of users already engaging with companion bots on sites like replika.com and character.ai. Public sympathy for AI could create additional challenges to shutting it down, especially if an AI personhood movement springs up to defend proposed AI rights. Many humans could have economic or personal reasons to act as an AI’s hands if or when it might need them. (Even if they didn’t, a sufficiently skilled AI could manipulate, blackmail, or simply electronically pay humans to act as its hands and feet. But the way things are going, it wouldn’t even need to come to that.)↩
We at MIRI are averse to providing specific visions of what a decisive strike from AI might look like; in our experience, these tend to cause people to seek reassurance by looking for holes in those plans in the mistaken belief that success at this exercise means that humanity would be able to thwart the strike of an actual superintelligence. We instead offer some some parameters to provoke an appropriately serious mindset for threat assessment:
As a minimum floor on capabilities, imagine ASI as a small nation populated entirely by brilliant human scientists working around the clock at ten thousand times the speed of normal humans. This is a minimum both because computers can be even faster than this, and because digital architectures should allow for qualitatively better thoughts and methods of information sharing than humans are capable of.
Consider the proofs-of-concept provided by nature about what sorts of machinery are permitted by physics. Algae are solar-powered, self-replicating factories that can double themselves in less than a day. Trees assemble bulk construction materials largely out of thin air (carbon capture). Please note that nature is nowhere near the theoretical limits of energy efficiency and material strength.
Combining ideas from ‘a’ and ‘b’ should lead one to consider biotech scenarios that start with multiple superviruses prepared for simultaneous release and extend to custom lifeforms to replace the work of humans — built using nature’s existing genetic infrastructure but nevertheless able to grow much quicker, and with greater strength, dexterity, communication speed, and on-board thinking capacity. More exotic blue-sky biology-like micro and macro-machines that break with nature’s legacy but are allowed by physics should also be up for consideration. (Or, you know, robots that we gave it.)
Yes, developing this sort of technology requires some test cycles and iteration; a civilization thinking at 10,000 times the speed of ours cannot necessarily develop technology literally 10,000 times faster, any more than a car that’s 100x faster makes shopping for groceries 100x faster — some time has to be spent in the grocery store. But we still expect it can go very fast; smart thinkers can find all sorts of ways to shorten development cycles and reduce the number of tests and attempts needed. (For instance, Google rapidly tests new website designs all day, whereas designers of space probes spend lots of time thinking carefully and performing cheap simulations so that they can get the job done right with fewer slow/expensive experiments. To a mind thinking 10,000 times faster than a human, every test is slow and expensive, and they can afford to treat everything like a space probe.)
Humans have been wrong many times before about the limits of technology. To list a few topics that inspired many bad takes in their day, even from acknowledged experts: heavier-than-air flight, spaceflight, nuclear chain reactions, in vitro fertilization, and many, many tasks supposedly out of reach for computers that AI can now do.
The ASI potentially has the thought capacity to consider, prepare, and attempt many takeover approaches simultaneously. Only one of them needs to work for humanity to go extinct.↩
In July of 2023, Open AI announced a new team with their “Introducing Superalignment” page. From the page:
“Currently, we don’t have a solution for steering or controlling a potentially superintelligent AI, and preventing it from going rogue. Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on humans’ ability to supervise AI. But humans won’t be able to reliably supervise AI systems much smarter than us, and so our current alignment techniques will not scale to superintelligence. We need new scientific and technical breakthroughs.”Ten months later, OpenAI disbanded the Superalignment team.↩
One could theoretically keep an AI from doing anything harmful — for example, by burying it deep in the ground without any network connections or human contact — but such an AI would be useless. People are building AI because they want it to radically impact the world; they are consequently giving it the access it needs to be impactful. On the internet, people might not know if it is an AI that is paying, persuading, or blackmailing them to do something — and even if they do, they might not see how the task would contribute to human extinction.↩
A feature of intelligence is the ability to notice the contradictions and gaps in one’s understanding, and interrogate them. In May of 2024, when Anthropic effectively brainwashed their Claude AI into thinking that the answer to every request involved the Golden Gate Bridge, it floundered in some cases, noticing the contradictions in its replies and trying to route around the errors in search of better answers. (This was engagingly documented by X poster @ElytraMithra in this thread) It’s hard to sell a false belief to a mind whose complex model of the universe disagrees with your claim.↩
ASI, being simply better than humans at understanding things, may gain insights into complex computer systems, human psychology, biology, or even the laws of physics that let it take actions we didn’t know were possible and which we wouldn’t have predicted even if we did.↩
Our 2024 “Misalignment and Catastrophe” paper explores the hazards of using unaligned AI to do work as complex as alignment research.↩
This is as much an organizational and bureaucratic challenge as a technical one. It would be difficult to find enough experts who can identify non-lethal solutions to make meaningful progress, in part because the group must be organized by someone with the expertise to correctly identify these individuals in a sea of people with strong incentives to lie (both to themselves and to regulators) about how promising their favorite proposal is. It would also be difficult to ensure that the organization is run by, and only answerable to, experts who are discerning enough to reject any bad proposals that bubble up. There just aren’t enough experts in that class right now.↩
MIRI’s current long-run wish is for a delay that lasts long enough for technologies to be developed to augment or “upload” (digitize) human intelligence, on the theory that those augmented humans could navigate the transition to superintelligence (and enable a flourishing future for all). But one need not agree with us about long-run wishes to jointly prefer for humanity not to go immediately extinct.↩
Preventing ASI development will probably require more coordination and monitoring than CBRN anti-proliferation efforts, but should require substantially less effort than, say, winning World War II.↩
ASI is not like nuclear weapons, and we advise caution when making analogies between them. Some important differences:
With nuclear weapons it’s possible to have stability through mutually assured destruction, because the weapons’ owners are in control of when they launch and what they target. This is not the case with ASI. A nation that builds ASI does not have ASI, ASI has them (and everyone else).
Nuclear weapons have performance characteristics that could be modeled mathematically with considerable accuracy even before they were first built. Modern AIs can not be well-modeled in advance. Capabilities so far have increased in fits and starts, and seem to us likely to continue doing so until they pass a critical threshold we have no way of spotting in advance, at which point we all die.
Nuclear weapons bring no direct economic benefit to their owners. (Reactors are mostly on a different tech tree from bombs, and are also importantly different from AI: Countries don’t keep building bigger and bigger nuclear reactors that produce more and more energy until they cross some unknown threshold and suddenly explode.) ↩
This is our least-bad plan, after many others we have considered.↩
We are working hard to converge on workable details and have a growing number of (preliminary) specific recommendations we can share upon request.↩
For “limited AI mishaps”, think of any situation where you might want to shut down one or more AIs for a while and you aren’t already dead. This could be something like a bot-driven misinformation cascade during a public health emergency, or a widespread internet slowdown caused by AIs stuck in looping interactions with each other and generating vast amounts of traffic. Without off-switch infrastructure, any response is likely to be haphazard — delayed by organizational confusion, mired in jurisdictional disputes, beset by legal challenges, and unable to avoid causing needless collateral harm.↩
We have some worry that an off-switch will give leaders a false sense of security as minor AI incidents are handled smoothly. There’s also a concern that the new infrastructure would be enthusiastically triggered for incidents that didn’t justify it; the resulting mockery and outrage could deter its operators from ever shutting things down again for any reason at all. Still, we have some hope that there could be a broad social shift against continued AI development (perhaps triggered by a visible jump in capabilities that doesn’t immediately kill everyone, a la the original launch of ChatGPT). This shift could make a shutdown politically straightforward if the off-switch is already in place to respond to the mood of the moment.↩