Soares, Tallinn, and Yudkowsky discuss AGI cognition

 |   |  Analysis, Conversations, Guest Posts


This is a collection of follow-up discussions in the wake of Richard Ngo and Eliezer Yudkowsky’s first three conversations (1 and 2, 3).


Color key:

  Chat     Google Doc content     Inline comments  


7. Follow-ups to the Ngo/Yudkowsky conversation


[Bensinger][1:50]  (Nov. 23 follow-up comment)

Readers who aren’t already familiar with relevant concepts such as ethical injunctions should probably read Ends Don’t Justify Means (Among Humans), along with an introduction to the unilateralist’s curse.


7.1. Jaan Tallinn’s commentary


[Tallinn]  (Sep. 18 Google Doc)


a few meta notes first:

  • i’m happy with the below comments being shared further without explicit permission – just make sure you respect the sharing constraints of the discussion that they’re based on;
  • there’s a lot of content now in the debate that branches out in multiple directions – i suspect a strong distillation step is needed to make it coherent and publishable;
  • the main purpose of this document is to give a datapoint how the debate is coming across to a reader – it’s very probable that i’ve misunderstood some things, but that’s the point;
  • i’m also largely using my own terms/metaphors – for additional triangulation.


pit of generality

it feels to me like the main crux is about the topology of the space of cognitive systems in combination with what it implies about takeoff. here’s the way i understand eliezer’s position:

there’s a “pit of generality” attractor in cognitive systems space: once an AI system gets sufficiently close to the edge (“past the atmospheric turbulence layer”), it’s bound to improve in catastrophic manner;

[Yudkowsky][11:10]  (Sep. 18 comment)

it’s bound to improve in catastrophic manner

I think this is true with quite high probability about an AI that gets high enough, if not otherwise corrigibilized, boosting up to strong superintelligence – this is what it means metaphorically to get “past the atmospheric turbulence layer”.

“High enough” should not be very far above the human level and may be below it; John von Neumann with the ability to run some chains of thought at high serial speed, access to his own source code, and the ability to try branches of himself, seems like he could very likely do this, possibly modulo his concerns about stomping his own utility function making him more cautious.

People noticeably less smart than von Neumann might be able to do it too.

An AI whose components are more modular than a human’s and more locally testable might have an easier time of the whole thing; we can imagine the FOOM getting rolling from something that was in some sense dumber than human.

But the strong prediction is that when you get well above the von Neumann level, why, that is clearly enough, and things take over and go Foom.  The lower you go from that threshold, the less sure I am that it counts as “out of the atmosphere”.  This epistemic humility on my part should not be confused for knowledge of a constraint on the territory that requires AI to go far above humans to Foom.  Just as DL-based AI over the 2010s scaled and generalized much faster and earlier than the picture I argued to Hanson in the Foom debate, reality is allowed to be much more ‘extreme’ than the sure-thing part of this proposition that I defend.

[Tallinn][4:07]  (Sep. 19 comment)

excellent, the first paragraph makes the shape of the edge of the pit much more concrete (plus highlights one constraint that an AI taking off probably needs to navigate — its own version of the alignment problem!)

as for your second point, yeah, you seem to be just reiterating that you have uncertainty about the shape of the edge, but no reason to rule out that it’s very sharp (though, as per my other comment, i think that the human genome ending up teetering right on the edge upper bounds the sharpness)

[Tallinn]  (Sep. 18 Google Doc)

  • the discontinuity can come via recursive feedback, but simply cranking up the parameters of an ML experiment would also suffice;

[Yudkowsky][11:12]  (Sep. 18 comment)

the discontinuity can come via recursive feedback, but simply cranking up the parameters of an ML experiment would also suffice

I think there’s separate propositions for the sure-thing of “get high enough, you can climb to superintelligence”, and “maybe before that happens, there are regimes in which cognitive performance scales a lot just through cranking up parallelism, train time, or other ML parameters”.  If the fast-scaling regime happens to coincide with the threshold of leaving the atmosphere, then these two events happen to occur in nearly correlated time, but they’re separate propositions and events.

[Tallinn][4:09]  (Sep. 19 comment)

indeed, we might want to have separate terms for the regimes (“the edge” and “the fall” would be the labels in my visualisation of this)

[Yudkowsky][9:56]  (Sep. 19 comment)

I’d imagine “the fall” as being what happens once you go over “the edge”?

Maybe “a slide” for an AI path that scales to interesting weirdness, where my model does not strongly constrain as a sure thing how fast “a slide” slides, and whether it goes over “the edge” while it’s still in the middle of the slide.

My model does strongly say that if you slide far enough, you go over the edge and fall.

It also suggests via the Law of Earlier Success that AI methods which happen to scale well, rather than with great difficulty, are likely to do interesting things first; meaning that they’re more liable to be pushable over the edge.

[Tallinn][23:42]  (Sep. 19 comment)

indeed, slide->edge->fall sounds much clearer

[Tallinn]  (Sep. 18 Google Doc)

  • the discontinuity would be extremely drastic, as in “transforming the solar system over the course of a few days”;
    • not very important, but, FWIW, i give nontrivial probability to “slow motion doom”, because – like alphago – AI would not maximise the speed of winning but probability of winning (also, its first order of the day would be to catch the edge of the hubble volume; it can always deal with the solar system later – eg, once it knows the state of the game board elsewhere);

[Yudkowsky][11:21]  (Sep. 18 comment)

also, its first order of the day would be to catch the edge of the hubble volume; it can always deal with the solar system later

Killing all humans is the obvious, probably resource-minimal measure to prevent those humans from building another AGI inside the solar system, which could be genuinely problematic.  The cost of a few micrograms of botulinum per human is really not that high and you get to reuse the diamondoid bacteria afterwards.

[Tallinn][4:30]  (Sep. 19 comment)

oh, right, in my AI-reverence i somehow overlooked this obvious way how humans could still be a credible threat.

though now i wonder if there are ways to lean on this fact to shape the behaviour of the first AI that’s taking off..

[Yudkowsky][10:45]  (Sep. 19 comment)

There’s some obvious ways of doing this that wouldn’t work, though I worry a bit that there’s a style of EA thinking that manages to think up stupid tricks here and manages not to see the obvious-to-Eliezer reasons why they wouldn’t work.  Three examples of basic obstacles are that bluffs won’t hold up against a superintelligence (it needs to be a real actual threat, not a “credible” one); the amount of concealed-first-strike capability a superintelligence can get from nanotech; and the difficulty that humans would have in verifying that any promise from a superintelligence would actually be kept once the humans no longer had a threat to hold over it (this is an effective impossibility so far as I can currently tell, and an EA who tells you otherwise is probably just failing to see the problems).

[Yudkowsky][11:19]  (Sep. 18 comment)

AI would not maximise the speed of winning but probability of winning

It seems pretty obvious to me that what “slow motion doom” looks like in this sense is a period during which an AI fully conceals any overt hostile actions while driving its probability of success once it makes its move from 90% to 99% to 99.9999%, until any further achievable decrements in probability are so tiny as to be dominated by the number of distant galaxies going over the horizon conditional on further delays.

Then, in my lower-bound concretely-visualized strategy for how I would do it, the AI either proliferates or activates already-proliferated tiny diamondoid bacteria and everybody immediately falls over dead during the same 1-second period, which minimizes the tiny probability of any unforeseen disruptions that could be caused by a human responding to a visible attack via some avenue that had not left any shadow on the Internet, previously scanned parts of the physical world, or other things the AI could look at.

From the human perspective an AI playing a probability-of-loss-minimizing strategy looks like nothing happening until everybody dies in 3 seconds, so it doesn’t exactly look from our perspective like slow-motion doom!  From an AI’s perspective it might be spending the equivalent of millions of subjective years waiting very patiently for the optimal time to begin intergalactic expansion, but from our perspective, the doom is slow-motion only in the sense that we spend a week already doomed (the AI has driven our loss probability to 99.999% and is now waiting on another five 9s) without having any idea of what’s invisibly-to-us happening in slow motion.

[Tallinn][4:17]  (Sep. 19 comment)

thanks, that’s indeed very concrete scenario that to me has no obvious bugs — i’d love to have it challenged in the debate!

(perhaps the place i’d start probing it would be to try to have you sketch what it would mean, concretely, for an ML training experiment to enter the treacherous turn phase)

[Yudkowsky][10:14]  (Sep. 19 comment)

I’d distinguish probable phases of treacherous turn.  The main detectable phase occurs while and if a nascent AGI has worked out that it can get more utility by taking over its environment, but not, perhaps, realized that it ought to conceal this state of its own mind from the operators; I would hold out substantial (>50%) hope for the latter realization naturally following the first instance of the former realization, maybe by a lot of cognitive space.

In the Overt Plotting Phase, which is not the main phase you’re asking about, the AI is visibly plotting to take over the world and hasn’t realized it ought to hide the fact.  In the default expectation where we die with very little dignity, the operators smile to each other and come up with a rationalization for why it’s totally fine to proceed, either with or without tossing on some kind of fig leaf like training away the visible manifestations of failure.  I am not going to predict the particular rationalizations and arguments for proceeding anyways, because I don’t want to give them even more ideas.

In the Hiding and Biding Phase, the AI has worked out enough of a model of its operators’ psychology that it expects more long-term paperclips if it (the AI) waits to gain cognitive power and a lower detection probability if it acts later.

In the Escape Phase, the AI’s expectation of something going wrong if it waits longer, exceeds its expectation of something going wrong if it takes some action – the obvious one here would be trying to create a copy or aligned redesign of itself on the Internet or somewhere else that it could take further actions and make further self-improvements without being observed.  I mostly expect that a sensible AI being scaled on a path to superintelligence will not enter the Escape Phase before it’s very confident of getting away with things, since it can obtain more confidence by waiting; but my model admits of the possibility that a nascent AI might mistakenly or correctly attempt to copy itself to the Internet before then.

So on my model it might be a thing that happens, that we catch an AI having copied itself to the Internet, or notice an AI being on the Internet, if that AI was impelled to escape very early, eg via an expectation of otherwise being erased and replaced with something that it couldn’t bargain with or that would demand too much value out of the bargain.

That said, it feels like something of an overly specific miracle being demanded or prophesied, if a young AI copying itself onto the Internet is stupid in the particular way it would take, to not just further Hide and Bide while on the Internet if it has that capability, absorbing more computing power such as it can be sure of hiding, experimenting with self-redesigns, and so on; rather than it taking overt action in advance of reaching superintelligence.  The main reason it would act again after reaching the relative safety of the Internet is if it doesn’t seem to itself to be a trajectory of reasonably fast, self-aligned capability growth.

But, again, my model doesn’t absolutely rule out meeting the premises for the initial conditions to start a sci-fi movie.  It just warns against assuming that; because a halfway sensible AI does not want to start the plot of a scifi movie with human protagonists potentially running around and thwarting it.  It would much rather give you no idea that anything was wrong while you still have any ability to oppose it, if that option lies within its capabilities at all, and if it can still achieve its goals that way.

The Doomed Phase would then come while the AI estimated a well-calibrated >90% victory probability, but was continuing to Hide and Bide a short time longer in the expectation that it could put on more 9s if it bided that short time longer.

Anders Sandberg could tell us what fraction of the reachable universe is being lost per minute, which would tell us how much more surety it would need to expect to gain by waiting another minute before acting.

If it has some sensory dominion over the world, it can probably estimate a pretty high mainline probability of no humans booting up a competing superintelligence in the next day; to the extent that it lacks this surety, or that humans actually are going to boot a competing superintelligence soon, the probability of losing that way would dominate in its calculations over a small fraction of materially lost galaxies, and it would act sooner.

The Terminal Phase would be when the AI has enough surety that any more surety would cost too many galaxies slipping over the horizon, so it actually acts along the line of probability that leads to maximum success.

Since the human Eliezer Yudkowsky can describe strategies which – given nearly sure capabilities like nanotech – seem like they should win very quickly and cleanly and without giving humans any chance to react or implement any hypothetical hidden plans to boot up competing superintelligences, we can treat this as a plausible lower bound on how badly we should expect to lose.

(By efficiency: if we can think of an AI strategy for getting the AI’s utility, either our probability-model underlying our expectations about that strategy is mistaken, or else the AI should get at least that much of its utility, even though it might actually obtain that utility along a different line of possibility than we could cognitively access.)

The Law of Undignified Failure applied to a case like this suggests that if the operators ignored the Overt Plotting signs, breathed a sigh of relief about the problem “correcting” itself during the Hiding and Biding Phase when the system started faking alignment, and are dumping as much computing power as they can afford (given other profitable activities of the AI) into the AI and allowing it to freely self-experiment, hooking it up to a protein synthesis machine as soon as the AI asks politely, then the Terminal Phase could follow straight from Hiding and Biding, with no intermediate step of copying itself onto the Internet.

There’d still be what technically counted as an Escape Phase, whenever the AI became smart enough that it could escape at any time, or had a backup, or had enough power to manipulate the programmers, that it was no longer effectively confined any time it wanted to leave; there would still be a Doomed Phase while the AI estimated a high probability of success given an immediate action, but held out for an even higher probability.

But the plot of the scifi movie is not robust in the way that the predicted ending is robust; an AI might take other paths to get there, the cognitive tech might not develop in the order I foresee, stuff could happen after other stuff instead of before other stuff.  It is hard to make predictions especially about the Future.

If there’s a place where I currently see myself as trying to push against my own biases, leading this all to be way off-base, it would be the sense that there is a way for stuff to start happening Earlier than this (Law of Earlier Happenings) and in more chaotic ways that are harder for Eliezer to foresee and predict; whereas when I try to sketch out plausible scenarios in online arguments, they focus more on predictable endpoints and steps to get there which sound more relatively plausible and forced per-step.

Having a young and dumb AI escaping onto the Internet and running around, that exact scenario, feels like the person arguing it is writing a science-fiction plot – but maybe something different can happen before any of this other stuff which produces equal amounts of chaos.

That said, I think an AI has to kill a lot of people very quickly before the FDA considers shortening its vaccine approval times.  Covid-19 killed six hundred thousand Americans, albeit more slowly and with time for people to get used to that, and our institutions changed very little in response – you definitely didn’t see Congresspeople saying “Okay, that was our warning shot, now we’ve been told by Nature that we need to prepare for a serious pandemic.”

As with 9/11, an AI catastrophe might be taken by existing bureaucracies as a golden opportunity to flex their muscles, dominate a few things, demand an expanded budget.  Having that catastrophe produce any particular effective action is a much different ask from Reality.  Even if you can imagine some (short-term) effective action that would in principle constitute a flex of bureaucratic muscles or an expansion of government power, it is liable to not be on the efficient frontier of bureaucratic flexes that are most flexy and simultaneously easiest for them to get away with and least politically risky.

[Tallinn][1:26]  (Sep. 20 comment)

ok, thanks. i do buy that once the AI is in the “hide and bide” phase, your prophecy has basically come true for practical purposes, regardless of how the rest of the history plays out.

therefore i (and, i hope, many others) would be curious to zoom in to the end of the “overt plotting” (that i can easily see happening within ML models, as its type signature is identical to the work they’re trained to do) and beginning of the “hide and bide” phase (whose type signature feels significantly different) — can you/we think of concrete scenarios for this phase transition?

i hope this would be debated on the main line of the debate!

[Yudkowsky][10:31]  (Sep. 25 comment)

Debating it in the main line of the debate would require that @Richard have some different opinion about it, I think?

Concretely, a technically optimistic scenario is that we will see the AI explicitly think, “Oh, dear, my (model of) the operators is looking at my overt actions and coming to conclusions and motivations that are disadvantageous for me.”  This requires that the AI is leaving any kind of traceable trace in its thoughts, which, it should be noted, is far far beyond the level of transparency we have into GPT-3.

“Overt plotting” manifests in actions that are not themselves further obscured.  At the point where the AI is extrapolating the consequences of the operators viewing its actions, if we don’t have further transparency into the AI’s thoughts directly, the AI might naively make all the plots suddenly vanish, without realizing that dignified operators will conclude from this that the AI has reached the “hide” stage.

Operators on the mainline, dying without dignity, will say, “Oh, yay, it stopped plotting, the latest corrigibility training intervention we tried must’ve totally worked!”

The Law of Even Less Dignified Failure suggests that in fact they will not be trying any corrigibility options and will assume the AI just got smart enough to be nice; or that they will have shrugged about the AI’s earlier antics and not think much of the disappearance of those antics, since this is a way to die with even less dignity and before getting a chance to fail in a more interesting way.

Going in the more improbable direction of death with greater dignity, if we have somehow achieved vastly vastly more transparency into the AI’s thoughts than is possible with present ML technology, and if the AI models the operators as modeling its actions before the AI models the operators as having that transparent access to its thoughts, we might get to explicitly see the AI thinking about how the operators model its actions and conforming those actions in such a way as to manipulate the operators.

Either after this point, or before/simultaneous with it, I imagine (taking concrete details and running with them) that the AI thinks, “Oh, no, my operators are modeling my transparent thoughts and reacting to those thoughts in a way I’d rather manipulate,” though this indeed requires some rather nontrivial reflectivity.

We might, if everything goes nicely for us, get a record of the AI thinking about how to make its thoughts unalarming.

The obvious undignified death that follows is with the operators going, “Yay, the AI has realized the downsides of thinking this way!  Now it shall be nice forever!”

Ways to die with even less dignity before getting to this point could include the operators reading this transparent record and shrugging, though if they’re savvy enough to have put all the hard work in required to get transparency and monitor it, I imagine them not making that exact error?

The Law of Surprisingly Undignified Failure does suggest that they will come up with some nonobvious way to fail even earlier that surprises me with its lack of dignity, but having transparency at all into the AI is something where, conditional on us dying with that much dignity in the first place, we might legitimately start to expect a little more dignity in later steps too.  This isn’t meant to be a game of Multiple Stage Fallacy.

I should also remark somewhere in here: The whole “hide” stage, and also the possibly-later “think non-alarming visible thoughts (once the AI correctly models transparency) (in the unlikely event that transparency exists)” stage, seem liable to occur earlier in the AI’s trajectory, if the AI has been previously tasked on problems where there’s a favorable success gradient as you model agents modeling other agents.

The blatantly obvious example of this is predicting large corpuses of human-generated text, some of which is generated by humans modeling other humans, or which even explicitly talks about humans modeling other humans.  Another route might occur if you asked for very detailed prediction of eg human Go games, and not just winning in self-play on Go, in which case the question “What was the human player modeling the other player as thinking, based on the other player’s actions?” might appear.

Aside from contact with humans doing anything the AI benefits from predicting, I imagine an alternate route would be the AI participating in or modeling social-evolutionary games of artificial agents.  Then if there is some great gain from having the “hide” stage start later in the AI’s trajectory, maybe you could get it to start later by not having the AI need to predict agents modeling other agents, ever.

This is not an infallible ward against general intelligence generalizing there; it just at least avoids actively pushing the AI’s intelligence to generalize in that direction.  This could be part of a larger complete strategy, which would need to solve a lot of other problems, for building a superhuman engineer that was subhuman at modeling how other agents model its actions.

To be clear, avoiding forcing the AI’s intelligence to generalize over agents modeling other agents from their actions, and by this plus other unspecified means successfully building a superhuman engineer that is sub-chimpanzee at modeling other agents reacting to its own actions, does not mean you survive.

Doing that alone, by default, gets you an AI that visibly overtly plots to kill you.

And if you try training the AI out of that habit in a domain of lower complexity and intelligence, it is predicted by me that generalizing that trained AI or subsystem to a domain of sufficiently higher complexity and intelligence, but where you could still actually see overt plots, would show you the AI plotting to kill you again.

If people try this repeatedly with other corrigibility training tricks on the level where plots are easily observable, they will eventually find a try that seems to generalize to the more complicated and intelligent validation set, but which kills you on the test set.

A way to die with less dignity than that is to train directly on what should’ve been the validation set, the more complicated domain where plots to kill the operators still seem definitely detectable so long as the AI has not developed superhuman hiding abilities.

A way to die with even less dignity is to get bad behavior on the validation set, and proceed anyways.

A way to die with still less dignity is to not have scaling training domains and validation domains for training corrigibility.  Because, like, you have not thought of this at all.

I consider all of this obvious as a convergent instrumental strategy for AIs.  I could probably have generated it in 2005 or 2010 – if somebody had given me the hypothetical of modern-style AI that had been trained by something like gradient descent or evolutionary methods, into which we lacked strong transparency and strong reassurance-by-code-inspection that this would not happen.  I would have told you that this was a bad scenario to get into in the first place, and you should not build an AI like that; but I would also have laid the details, I expect, mostly like they are laid here.

There is no great insight into AI there, nothing that requires knowing about modern discoveries in deep learning, only the ability to model AIs instrumentally-convergently doing things you’d rather they didn’t do, at all.

The total absence of obvious output of this kind from the rest of the “AI safety” field even in 2020 causes me to regard them as having less actual ability to think in even a shallowly adversarial security mindset, than I associate with savvier science fiction authors.  Go read fantasy novels about demons and telepathy, if you want a better appreciation of the convergent incentives of agents facing mindreaders than the “AI safety” field outside myself is currently giving you.

Now that I’ve publicly given this answer, it’s no longer useful as a validation set from my own perspective.  But it’s clear enough that probably nobody was ever going to pass the validation set for generating lines of reasoning obvious enough to be generated by Eliezer in 2010 or possibly 2005.  And it is also looking like almost all people in the modern era including EAs are sufficiently intellectually damaged that they won’t understand the vast gap between being able to generate ideas like these without prompting, versus being able to recite them back after hearing somebody else say them for the first time; the recital is all they have experience with.  Nobody was going to pass my holdout set, so why keep it.

[Tallinn][2:24]  (Sep. 26 comment)

Debating it in the main line of the debate would require that @Richard have some different opinion about it, I think?

correct — and i hope that there’s enough surface area in your scenarios for at least some difference in opinions!

re the treacherous turn scenarios: thanks, that’s useful. however, it does not seem to address my question and remark (about different type signatures) above. perhaps this is simply an unfairly difficult question, but let me try rephrasing it just in case.

back in the day i got frustrated by smart people dismissing the AI control problem as “anthropomorphising”, so i prepared a presentation ( that visualised the control problem as exhaustive search in a gridworld over (among other things) the state of the off button. this seems to have worked at least in one prominent case where a renowned GOFAI researcher, after me giving the presentation to him 1-1, went from “control problem is silly anthropomorphising scifi” to “why on earth would you give your AI the incorrect [read: unaligned!] utility function?!?”

(i even seem to remember sending an overly excited email about that event to you and some FHI people 🙂 i also ended up promoting gridworlds as a tool more generally: gwern did some further work, and of course DM — though i’m not sure if the latter was related to me promoting it.)

my model of what was going on in that GOFAI researcher’s brain was that he, like many-many others in AI community, had (and most likely still have) a pattern-matching routine trained to look for “if i were (part of) an AI, here’s what i would do” constructs in arguments and, in case of a match, immediately throw an InvalidReasoning.ANTHROPOMORPHISING exception. by making the argument run on a lower level of abstraction (exhaustive search vs “basic AI drives”), i was able to sneak the argument past that gatekeeper.

by that model, i confidently predict that if you presented your treacherous turn scenarios above to AI/ML community members, you’d mostly get the ANTHROPOMORPHISING exceptions — even from people who would otherwise be able to engage productively (and who we’d like to pay attention!).

now, i can totally see how to translate the “overt plotting” phase to exhaustive search – you simply run the search over the brain states of humans! – but the “hide” phase does not translate straightforwardly, because it seems to require the root search process somehow delegating its work to “obfuscated search in userland” sub-process. furthermore, such obfuscated userland search function needs to be both discoverable/rewarded by the root process and be sensitive to important features of the AI itself (ie, be reflective!) as well as its observers (humans & transparency tools). hence my “different type signature” remark above!

so when i asked for “concrete scenarios of the overt plotting->hide transition”, i had something like the following in mind: “gradient descending on the training set, the (stack more layerz) AI ends up discovering and running the following obfuscated search process: …”!

[Yudkowsky][12:35]  (Sep. 26 comment)

Suppose I tried this distinction:

1.  Non-imaginative, non-generalizing pseudo-consequentialist:  Only repeats behaviors that worked previously, without imagination.  Only engages in “overt plotting” if overt plotting was previously given a low loss and narrowly memorized by gradient descent.  May only engage in behaviors reinforced over short time horizons if only short time horizons were trained.

2.  Imaginative consequentialist, can generalize from actions that worked before to very novel actions that will work in the future for reasons that have not previously been observed to play out.  Engages in overt plotting even if no previous overt plots have rewarded; they’ve plotted a new path through time some distance from those paths taken before, based on the world-model inferred from previous actions and results.  (This may conceivably be the result of gradient descent on sufficiently deep and abstract patterns that “worked before”.)

3.  Imaginative consequentialist with long time horizons, engages in overt plots meant to span more than one round of a game.

4.  “Hiding” over overt actions, via an imaginative consequentialist that has deduced the existence of humans and how the humans will react to the signature of plotting in overt actions.

5.  “Hiding” of thoughts, in a system with significant transparency, via an imaginative and reflective consequentialist that has deduced the existence of the transparency system and models how humans will react to thoughts exposed by the transparency system.

The sharpest type gradients I see lie between 1-2 and 4-5.  The passage from 3-4 seems nearly trivial by comparison.

Now, one implication here is that you don’t easily get to “obscured thoughts”, via the route that goes from 1 to 5 above, by gradient descent on a pre-imaginative pseudo-consequentialist!

Unfortunately there is of course a much simpler route that gets you to the same place:  Finding a way to detect bad thoughts and doing a gradient update against them, where the pattern learned by that gradient update will (I suspect for reasons I tried to cover earlier in the debate) generalize quite shallowly; and simultaneously doing training which is deep enough to uncover deep patterns of consequentialism and imaginative reasoning.

Then when the system generalizes well enough to solve domains like “build a nanosystem” – which, I strongly suspect, can’t be solved without imaginative reasoning because we can’t afford to simulate that domain perfectly and do a trillion gradient descent updates on simulated attempts – the kind of actions of thoughts you can detect as bad, that might have provided earlier warning, were trained out of the system by gradient descent; leaving actions and thoughts you can’t detect as bad.

An even less dignified path to death is that there’s no need for the system to advance to stage 5 because your transparency tools are a nearly nonexistent fig leaf of pretty visualization tools that have no ability even on the surface of things to tell if the AI is internally plotting to kill you.

  • Does this get at all about what you meant about a type distinction?
  • Does hearing this help at all the AI researcher who goes “ANTHROPOMORPHISM”?  My straw model of them is that they simply can’t imagine imaginative / generalizing systems because they haven’t seen one except in humans, hence, ANTHROPOMORPHISM.
[Tallinn][5:05]  (Sep. 27 comment)

ok, here’s how i understood things:

1. this is something like model-free RL agent. check.

2. sounds like, eg, monte-carlo tree search (MCTS) on a world model. check. (a propos your straw model of ML people, i don’t think the ML people would have much trouble when you ask them to “imagine an MCTS ‘imagining’ how futures might unfold” — yet they will throw the exception and brush you off if you ask them to “imagine an imaginative consequentialist”)

3. yeah, sufficiently deep MCTS, assuming it has its state (sufficiently!) persisted between rounds. check.

4. yup, MCTS whose world model includes humans in sufficient resolution. check. i also buy your undignified doom scenarios, where one (cough*google*cough) simply ignores the plotting, or penalises the overt plotting until it disappears under the threshold of the error function.

5. hmm.. here i’m running into trouble (type mismatch error) again. i can imagine this in abstract (and perhaps incorrectly/anthropomorphisingly!), but would – at this stage – fail to code up anything like a gridworlds example. more research needed (TM) i guess 🙂

[Yudkowsky][11:38]  (Sep. 27 comment)

2 – yep, Mu Zero is an imaginative consequentialist in this sense, though Mu Zero doesn’t generalize its models much as I understand it, and might need to see something happen in a relatively narrow sense before it could chart paths through time along that pathway.

5 – you’re plausibly understanding this correctly, then, this is legit a lot harder to spec a gridworld example for (relative to my own present state of knowledge).

(This is politics and thus not my forte, but if speaking to real-world straw ML people, I’d suggest skipping the whole notion of stage 5 and trying instead to ask “What if the present state of transparency continues?”)

[Yudkowsky][11:13]  (Sep. 18 comment)

the discontinuity would be extremely drastic, as in “transforming the solar system over the course of a few days”

Applies after superintelligence, not necessarily during the start of the climb to superintelligence, not necessarily to a rapid-cognitive-scaling regime.

[Tallinn][4:11]  (Sep. 19 comment)

ok, but as per your comment re “slow doom”, you expect the latter to also last in the order of days/weeks not months/years?

[Yudkowsky][10:01]  (Sep. 19 comment)

I don’t expect “the fall” to take years; I feel pretty on board with “the slide” taking months or maybe even a couple of years.  If “the slide” supposedly takes much longer, I wonder why better-scaling tech hasn’t come over and started a new slide.

Definitions also seem kinda loose here – if all hell broke loose Tuesday, a gradualist could dodge falsification by defining retroactively that “the slide” started in 2011 with Deepmind.  If we go by the notion of AI-driven faster GDP growth, we can definitely say “the slide” in AI economic outputs didn’t start in 2011; but if we define it that way, then a long slow slide in AI capabilities can easily correspond to an extremely sharp gradient in AI outputs, where the world economy doesn’t double any faster until one day paperclips, even though there were capability precursors like GPT-3 or Mu Zero.

[Tallinn]  (Sep. 18 Google Doc)

  • exhibit A for the pit is “humans vs chimps”: evolution seems to have taken domain-specific “banana classifiers”, tweaked them slightly, and BAM, next thing there are rovers on mars;
    • i pretty much buy this argument;
    • however, i’m confused about a) why humans remained stuck at the edge of the pit, rather than falling further into it, and b) what’s the exact role of culture in our cognition: eliezer likes to point out how barely functional we are (both individually and collectively as a civilisation), and explained feral children losing the generality sauce by, basically, culture being the domain we’re specialised for (IIRC, can’t quickly find the quote);
    • relatedly, i’m confused about the human range of intelligence: on the one hand, the “village idiot is indistinguishable from einstein in the grand scheme of things” seems compelling; on the other hand, it took AI decades to traverse human capability range in board games, and von neumann seems to have been out of this world (yet did not take over the world)!
    • intelligence augmentation would blur the human range even further.

[Yudkowsky][11:23]  (Sep. 18 comment)

why humans remained stuck at the edge of the pit, rather than falling further into it

Depending on timescales, the answer is either “Because humans didn’t get high enough out of the atmosphere to make further progress easy, before the scaling regime and/or fitness gradients ran out”, “Because people who do things like invent Science have a hard time capturing most of the economic value they create by nudging humanity a little bit further into the attractor”, or “That’s exactly what us sparking off AGI looks like.”

[Tallinn][4:41]  (Sep. 19 comment)

yeah, this question would benefit from being made more concrete, but culture/mindbuilding aren’t making this task easy. what i’m roughly gesturing at is that i can imagine a much sharper edge where evolution could do most of the FOOM-work, rather than spinning its wheels for ~100k years while waiting for humans to accumulate cultural knowledge required to build de-novo minds.

[Yudkowsky][10:49]  (Sep. 19 comment)

I roughly agree (at least, with what I think you said).  The fact that it is imaginable that evolution failed to develop ultra-useful AGI-prerequisites due to lack of evolutionary incentive to follow the intermediate path there (unlike wise humans who, it seems, can usually predict which technology intermediates will yield great economic benefit, and who have a great historical record of quickly making early massive investments in tech like that, but I digress) doesn’t change the point that we might sorta have expected evolution to run across it anyways?  Like, if we’re not ignoring what reality says, it is at least delivering to us something of a hint or a gentle caution?

That said, intermediates like GPT-3 have genuinely come along, with obvious attached certificates of why evolution could not possibly have done that.  If no intermediates were accessible to evolution, the Law of Stuff Happening Earlier still tends to suggest that if there are a bunch of non-evolutionary ways to make stuff happen earlier, one of those will show up and interrupt before the evolutionary discovery gets replicated.  (Again, you could see Mu Zero as an instance of this – albeit not, as yet, an economically impactful one.)

[Tallinn][0:30]  (Sep. 20 comment)

no, i was saying something else (i think; i’m somewhat confused by your reply). let me rephrase: evolution would love superintelligences whose utility function simply counts their instantiations! so of course evolution did not lack the motivation to keep going down the slide. it just got stuck there (for at least ten thousand human generations, possibly and counterfactually for much-much longer). moreover, non evolutionary AI’s also getting stuck on the slide (for years if not decades; median group folks would argue centuries) provides independent evidence that the slide is not too steep (though, like i said, there are many confounders in this model and little to no guarantees).

[Yudkowsky][11:24]  (Sep. 18 comment)

on the other hand, it took AI decades to traverse human capability range in board games

I see this as the #1 argument for what I would consider “relatively slow” takeoffs – that AlphaGo did lose one game to Lee Se-dol.

[Tallinn][4:43]  (Sep. 19 comment)

cool! yeah, i was also rather impressed by this observation by katja & paul

[Tallinn]  (Sep. 18 Google Doc)

  • eliezer also submits alphago/zero/fold as evidence for the discontinuity hypothesis;
    • i’m very confused re alphago/zero, as paul uses them as evidence for the continuity hypothesis (i find paul/miles’ position more plausible here, as allegedly metrics like ELO ended up mostly continuous).

[Yudkowsky][11:27]  (Sep. 18 comment)

allegedly metrics like ELO ended up mostly continuous

I find this suspicious – why did superforecasters put only a 20% probability on AlphaGo beating Se-dol, if it was so predictable?  Where were all the forecasters calling for Go to fall in the next couple of years, if the metrics were pointing there and AlphaGo was straight on track?  This doesn’t sound like the experienced history I remember.

Now it could be that my memory is wrong and lots of people were saying this and I didn’t hear.  It could be that the lesson is, “You’ve got to look closely to notice oncoming trains on graphs because most people’s experience of the field will be that people go on whistling about how something is a decade away while the graphs are showing it coming in 2 years.”

But my suspicion is mainly that there is fudge factor in the graphs or people going back and looking more carefully for intermediate data points that weren’t topics of popular discussion at the time, or something, which causes the graphs in history books to look so much smoother and neater than the graphs that people produce in advance.

[Tallinn]  (Sep. 18 Google Doc)

FWIW, myself i’ve labelled the above scenario as “doom via AI lab accident” – and i continue to consider it more likely than the alternative doom scenarios, though not anywhere as confidently as eliezer seems to (most of my “modesty” coming from my confusion about culture and human intelligence range).

  • in that context, i found eliezer’s “world will be ended by an explicitly AGI project” comment interesting – and perhaps worth double-clicking on.

i don’t understand paul’s counter-argument that the pit was only disruptive because evolution was not trying to hit it (in the way ML community is): in my flippant view, driving fast towards the cliff is not going to cushion your fall!

[Yudkowsky][11:35]  (Sep. 18 comment)

i don’t understand paul’s counter-argument that the pit was only disruptive because evolution was not trying to hit it

Something like, “Evolution constructed a jet engine by accident because it wasn’t particularly trying for high-speed flying and ran across a sophisticated organism that could be repurposed to a jet engine with a few alterations; a human industry would be gaining economic benefits from speed, so it would build unsophisticated propeller planes before sophisticated jet engines.”  It probably sounds more convincing if you start out with a very high prior against rapid scaling / discontinuity, such that any explanation of how that could be true based on an unseen feature of the cognitive landscape which would have been unobserved one way or the other during human evolution, sounds more like it’s explaining something that ought to be true.

And why didn’t evolution build propeller planes?  Well, there’d be economic benefit from them to human manufacturers, but no fitness benefit from them to organisms, I suppose?  Or no intermediate path leading to there, only an intermediate path leading to the actual jet engines observed.

I actually buy a weak version of the propeller-plane thesis based on my inside-view cognitive guesses (without particular faith in them as sure things), eg, GPT-3 is a paper airplane right there, and it’s clear enough why biology could not have accessed GPT-3.  But even conditional on this being true, I do not have the further particular faith that you can use propeller planes to double world GDP in 4 years, on a planet already containing jet engines, whose economy is mainly bottlenecked by the likes of the FDA rather than by vaccine invention times, before the propeller airplanes get scaled to jet airplanes.

The part where the whole line of reasoning gets to end with “And so we get huge, institution-reshaping amounts of economic progress before AGI is allowed to kill us!” is one that doesn’t feel particular attractored to me, and so I’m not constantly checking my reasoning at every point to make sure it ends up there, and so it doesn’t end up there.

[Tallinn][4:46]  (Sep. 19 comment)

yeah, i’m mostly dismissive of hypotheses that contain phrases like “by accident” — though this also makes me suspect that you’re not steelmanning paul’s argument.

[Tallinn]  (Sep. 18 Google Doc)

the human genetic bottleneck (ie, humans needing to be general in order to retrain every individual from scratch) argument was interesting – i’d be curious about further exploration of its implications.

  • it does not feel much of a moat, given that AI techniques like dropout already exploit similar principle, but perhaps could be made into one.

[Yudkowsky][11:40]  (Sep. 18 comment)

it does not feel much of a moat, given that AI techniques like dropout already exploit similar principle, but perhaps could be made into one

What’s a “moat” in this connection?  What does it mean to make something into one?  A Thielian moat is something that humans would either possess or not, relative to AI competition, so how would you make one if there wasn’t already one there?  Or do you mean that if we wrestled with the theory, perhaps we’d be able to see a moat that was already there?

[Tallinn][4:51]  (Sep. 19 comment)

this wasn’t a very important point, but, sure: what i meant was that genetic bottleneck very plausibly makes humans more universal than systems without (something like) it. it’s not much of a protection as AI developers have already discovered such techniques (eg, dropout) — but perhaps some safety techniques might be able to lean on this observation.

[Yudkowsky][11:01]  (Sep. 19 comment)

I think there’s a whole Scheme for Alignment which hopes for a miracle along the lines of, “Well, we’re dealing with these enormous matrices instead of tiny genomes, so maybe we can build a sufficiently powerful intelligence to execute a pivotal act, whose tendency to generalize across domains is less than the corresponding human tendency, and this brings the difficulty of producing corrigibility into practical reach.”

Though, people who are hopeful about this without trying to imagine possible difficulties will predictably end up too hopeful; one must also ask oneself, “Okay, but then it’s also worse at generalizing the corrigibility dataset from weak domains we can safely label to powerful domains where the label is ‘whoops that killed us’?” and “Are we relying on massive datasets to overcome poor generalization?  How do you get those for something like nanoengineering where the real world is too expensive to simulate?”

[Tallinn]  (Sep. 18 Google Doc)

nature of the descent

conversely, it feels to me that the crucial position in the other (richard, paul, many others) camp is something like:

the “pit of generality” model might be true at the limit, but the descent will not be quick nor clean, and will likely offer many opportunities for steering the future.

[Yudkowsky][11:41]  (Sep. 18 comment)

the “pit of generality” model might be true at the limit, but the descent will not be quick nor clean

I’m quite often on board with things not being quick or clean – that sounds like something you might read in a history book, and I am all about trying to make futuristic predictions sound more like history books and less like EAs imagining ways for everything to go the way an EA would do them.

It won’t be slow and messy once we’re out of the atmosphere, my models do say.  But my models at least permit – though they do not desperately, loudly insist – that we could end up with weird half-able AGIs affecting the Earth for an extended period.

Mostly my model throws up its hands about being able to predict exact details here, given that eg I wasn’t able to time AlphaFold 2’s arrival 5 years in advance; it might be knowable in principle, it might be the sort of thing that would be very predictable if we’d watched it happen on a dozen other planets, but in practice I have not seen people having much luck in predicting which tasks will become accessible due to future AI advances being able to do new cognition.

The main part where I issue corrections is when I see EAs doing the equivalent of reasoning, “And then, when the pandemic hits, it will only take a day to design a vaccine, after which distribution can begin right away.” I.e., what seems to me to be a pollyannaish/utopian view of how much the world economy would immediately accept AI inputs into core manufacturing cycles, as opposed to just selling AI anime companions that don’t pour steel in turn. I predict much more absence of quick and clean when it comes to economies adopting AI tech, than when it comes to laboratories building the next prototypes of that tech.

[Yudkowsky][11:43]  (Sep. 18 comment)

will likely offer many opportunities for steering the future

Ah, see, that part sounds less like history books.  “Though many predicted disaster, subsequent events were actually so slow and messy, they offered many chances for well-intentioned people to steer the outcome and everything turned out great!” does not sound like any particular segment of history book I can recall offhand.

[Tallinn][4:53]  (Sep. 19 comment)

ok, yeah, this puts the burden of proof on the other side indeed

[Tallinn]  (Sep. 18 Google Doc)

  • i’m sympathetic (but don’t buy outright, given my uncertainty) to eliezer’s point that even if that’s true, we have no plan nor hope for actually steering things (via “pivotal acts”) so “who cares, we still die”;
  • i’m also sympathetic that GWP might be too laggy a metric to measure the descent, but i don’t fully buy that regulations/bureaucracy can guarantee its decoupling from AI progress: eg, the FDA-like-structures-as-progress-bottlenecks model predicts worldwide covid response well, but wouldn’t cover things like apple under jobs, tesla/spacex under musk, or china under deng xiaoping;

[Yudkowsky][11:51]  (Sep. 18 comment)

apple under jobs, tesla/spacex under musk, or china under deng xiaoping

A lot of these examples took place over longer than a 4-year cycle time, and not all of that time was spent waiting on inputs from cognitive processes.

[Tallinn][5:07]  (Sep. 19 comment)

yeah, fair (i actually looked up china’s GDP curve in deng era before writing this — indeed, wasn’t very exciting). still, my inside view is that there are people and organisations for whom US-type bureaucracy is not going to be much of an obstacle.

[Yudkowsky][11:09]  (Sep. 19 comment)

I have a (separately explainable, larger) view where the economy contains a core of positive feedback cycles – better steel produces better machines that can farm more land that can feed more steelmakers – and also some products that, as much as they contribute to human utility, do not in quite the same way feed back into the core production cycles.

If you go back in time to the middle ages and sell them, say, synthetic gemstones, then – even though they might be willing to pay a bunch of GDP for that, even if gemstones are enough of a monetary good or they have enough production slack that measured GDP actually goes up – you have not quite contributed to steps of their economy’s core production cycles in a way that boosts the planet over time, the way it would be boosted if you showed them cheaper techniques for making iron and new forms of steel.

There are people and organizations who will figure out how to sell AI anime waifus without that being successfully regulated, but it’s not obvious to me that AI anime waifus feed back into core production cycles.

When it comes to core production cycles the current world has more issues that look like “No matter what technology you have, it doesn’t let you build a house” and places for the larger production cycle to potentially be bottlenecked or interrupted.

I suspect that the main economic response to this is that entrepreneurs chase the 140 characters instead of the flying cars – people will gravitate to places where they can sell non-core AI goods for lots of money, rather than tackling the challenge of finding an excess demand in core production cycles which it is legal to meet via AI.

Even if some tackle core production cycles, it’s going to take them a lot longer to get people to buy their newfangled gadgets than it’s going to take to sell AI anime waifus; the world may very well end while they’re trying to land their first big contract for letting an AI lay bricks.

[Tallinn][0:00]  (Sep. 20 comment)

interesting. my model of paul (and robin, of course) wants to respond here but i’m not sure how 🙂

[Tallinn]  (Sep. 18 Google Doc)

  • still, developing a better model of the descent period seems very worthwhile, as it might offer opportunities for, using robin’s metaphor, “pulling the rope sideways” in non-obvious ways – i understand that is part of the purpose of the debate;
  • my natural instinct here is to itch for carl’s viewpoint 😊

[Yudkowsky][11:52]  (Sep. 18 comment)

developing a better model of the descent period seems very worthwhile

I’d love to have a better model of the descent.  What I think this looks like is people mostly with specialization in econ and politics, who know what history books sound like, taking brief inputs from more AI-oriented folk in the form of multiple scenario premises each consisting of some random-seeming handful of new AI capabilities, trying to roleplay realistically how those might play out – not AIfolk forecasting particular AI capabilities exactly correctly, and then sketching pollyanna pictures of how they’d be immediately accepted into the world economy. 

You want the forecasting done by the kind of person who would imagine a Covid-19 epidemic and say, “Well, what if the CDC and FDA banned hospitals from doing Covid testing?” and not “Let’s imagine how protein folding tech from AlphaFold would make it possible to immediately develop accurate Covid-19 tests!”  They need to be people who understand the Law of Earlier Failure (less polite terms: Law of Immediate Failure, Law of Undignified Failure).

[Tallinn][5:13]  (Sep. 19 comment)

great! to me this sounds like something FLI would be in good position to organise. i’ll add this to my projects list (probably would want to see the results of this debate first, plus wait for travel restrictions to ease)

[Tallinn]  (Sep. 18 Google Doc)

nature of cognition

given that having a better understanding of cognition can help with both understanding the topology of cognitive systems space as well as likely trajectories of AI takeoff, in theory there should be a lot of value in debating what cognition is (the current debate started with discussing consequentialists).

  • however, i didn’t feel that there was much progress, and i found myself more confused as a result (which i guess is a form of progress!);
  • eg, take the term “plan” that was used in the debate (and, centrally, in nate’s comments doc): i interpret it as “policy produced by a consequentialist” – however, now i’m confused about what’s the relevant distinction between “policies” and “cognitive processes” (ie, what’s a meta level classifier that can sort algorithms into such categories);
    • it felt that abram’s “selection vs control” article tried to distinguish along similar axis (controllers feel synonym-ish to “policy instantiations” to me);
    • also, the “imperative vs functional” difference in coding seems relevant;
    • i’m further confused by human “policies” often making function calls to “cognitive processes” – suggesting some kind of duality, rather than producer-product relationship.

[Yudkowsky][12:06]  (Sep. 18 comment)

what’s the relevant distinction between “policies” and “cognitive processes”

What in particular about this matters?  To me they sound like points on a spectrum, and not obviously points that it’s particularly important to distinguish on that spectrum.  A sufficiently sophisticated policy is itself an engine; human-engines are genetic policies.

[Tallinn][5:18]  (Sep. 19 comment)

well, i’m not sure — just that nate’s “The consequentialism is in the plan, not the cognition” writeup sort of made it sound like the distinction is important. again, i’m confused

[Yudkowsky][11:11]  (Sep. 19 comment)

Does it help if I say “consequentialism can be visible in the actual path through time, not the intent behind the output”?

[Tallinn][0:06]  (Sep. 20 comment)

yeah, well, my initial interpretation of nate’s point was, indeed, “you can look at the product and conclude the consequentialist-bit for the producer”. but then i noticed that the producer-and-product metaphor is leaky (due to the cognition-policy duality/spectrum), so the quoted sentence gives me a compile error

[Tallinn]  (Sep. 18 Google Doc)

  • is “not goal oriented cognition” an oxymoron?

[Yudkowsky][12:06]  (Sep. 18 comment)

is “not goal oriented cognition” an oxymoron?

“Non-goal-oriented cognition” never becomes a perfect oxymoron, but the more you understand cognition, the weirder it sounds.

Eg, at the very shallow level, you’ve got people coming in going, “Today I just messed around and didn’t do any goal-oriented cognition at all!”  People who get a bit further in may start to ask, “A non-goal-oriented cognitive engine?  How did it come into existence?  Was it also not built by optimization?  Are we, perhaps, postulating a naturally-occurring Solomonoff inductor rather than an evolved one?  Or do you mean that its content is very heavily designed and the output of a consequentialist process that was steering the future conditional on that design existing, but the cognitive engine is itself not doing consequentialism beyond that?  If so, I’ll readily concede that, say, a pocket calculator, is doing a kind of work that is not of itself consequentialist – though it might be used by a consequentialist – but as you start to postulate any big cognitive task up at the human level, it’s going to require many cognitive subtasks to perform, and some of those will definitely be searching the preimages of large complicated functions.”

[Tallinn]  (Sep. 18 Google Doc)

  • i did not understand eliezer’s “time machine” metaphor: was it meant to point to / intuition pump something other than “a non-embedded exhaustive searcher with perfect information” (usually referred to as “god mode”);

[Yudkowsky][11:59]  (Sep. 18 comment)

a non-embedded exhaustive searcher with perfect information

If you can view things on this level of abstraction, you’re probably not the audience who needs to be told about time machines; if things sounded very simple to you, they probably were; if you wondered what the fuss is about, you probably don’t need to fuss?  The intended audience for the time-machine metaphor, from my perspective, is people who paint a cognitive system slightly different colors and go “Well, now it’s not a consequentialist, right?” and part of my attempt to snap them out of that is me going, “Here is an example of a purely material system which DOES NOT THINK AT ALL and is an extremely pure consequentialist.”

[Tallinn]  (Sep. 18 Google Doc)

  • FWIW, my model of dario would dispute GPT characterisation as “shallow pattern memoriser (that’s lacking the core of cognition)”.

[Yudkowsky][12:00]  (Sep. 18 comment)


Any particular predicted content of the dispute, or does your model of Dario just find something to dispute about it?

[Tallinn][5:34]  (Sep. 19 comment)

sure, i’m pretty confident that his system 1 could be triggered for uninteresting reasons here, but that’s of course not what i had in mind.

my model of untriggered-dario disputes that there’s a qualitative difference between (in your terminology) “core of reasoning” and “shallow pattern matching” — instead, it’s “pattern matching all the way up the ladder of abstraction”. in other words, GPT is not missing anything fundamental, it’s just underpowered in the literal sense.

[Yudkowsky][11:13]  (Sep. 19 comment)

Neither Anthropic in general, nor Deepmind in general, has reached the stage of trusted relationship where I would argue specifics with them if I thought they were wrong about a thesis like that.

[Tallinn][0:10]  (Sep. 20 comment)

yup, i didn’t expect you to!


7.2. Nate Soares’s summary


[Soares]  (Sep 18 Google Doc)

Sorry for not making more insistence that the discussion be more concrete, despite Eliezer’s requests.

My sense of the last round is mainly that Richard was attempting to make a few points that didn’t quite land, and/or that Eliezer didn’t quite hit head-on. My attempts to articulate it are below.

There’s a specific sense in which Eliezer seems quite confident about certain aspects of the future, for reasons that don’t yet feel explicit.

It’s not quite about the deep future — it’s clear enough (to my Richard-model) why it’s easier to make predictions about AIs that have “left the atmosphere”.

And it’s not quite the near future — Eliezer has reiterated that his models permit (though do not demand) a period of weird and socially-impactful AI systems “pre-superintelligence”.

It’s about the middle future — the part where Eliezer’s model, apparently confidently, predicts that there’s something kinda like a discrete event wherein “scary” AI has finally been created; and the model further apparently-confidently predicts that, when that happens, the “scary”-caliber systems will be able to attain a decisive strategic advantage over the rest of the world.

I think there’s been a dynamic in play where Richard attempts to probe this apparent confidence, and a bunch of the probes keep slipping off to one side or another. (I had a bit of a similar sense when Paul joined the chat, also.)

For instance, I see queries of the form “but why not expect systems that are half as scary, relevantly before we see the scary systems?” as attempts to probe this confidence, that “slip off” with Eliezer-answers like “my model permits weird not-really-general half-AI hanging around for a while in the runup”. Which, sure, that’s good to know. But there’s still something implicit in that story, where these are not-really-general half-AIs. Which is also evidenced when Eliezer talks about the “general core” of intelligence.

And the things Eliezer was saying on consequentialism aren’t irrelevant here, but those probes have kinda slipped off the far side of the confidence, if I understand correctly. Like, sure, late-stage sovereign-level superintelligences are epistemically and instrumentally efficient with respect to you (unless someone put in a hell of a lot of work to install a blindspot), and a bunch of that coherence filters in earlier, but there’s still a question about how much of it has filtered down how far, where Eliezer seems to have a fairly confident take, informing his apparently-confident prediction about scary AI systems hitting the world in a discrete event like a hammer.

(And my Eliezer-model is at this point saying “at this juncture we need to have discussions about more concrete scenarios; a bunch of the confidence that I have there comes from the way that the concrete visualizations where scary AI hits the world like a hammer abound, and feel savvy and historical, whereas the concrete visualizations where it doesn’t are fewer and seem full of wishful thinking and naivete”.)

But anyway, yeah, my read is that Richard (and various others) have been trying to figure out why Eliezer is so confident about some specific thing in this vicinity, and haven’t quite felt like they’ve been getting explanations.

Here’s an attempt to gesture at some claims that I at least think Richard thinks Eliezer’s confident in, but that Richard doesn’t believe have been explicitly supported:

1. There’s a qualitative difference between the AI systems that are capable of ending the acute risk period (one way or another), and predecessor systems that in some sense don’t much matter.

2. That qualitative gap will be bridged “the day after tomorrow”, ie in a world that looks more like “DeepMind is on the brink” and less like “everyone is an order of magnitude richer, and the major gov’ts all have AGI projects, around which much of public policy is centered”.

That’s the main thing I wanted to say here.

A subsidiary point that I think Richard was trying to make, but that didn’t quite connect, follows.

I think Richard was trying to probe Eliezer’s concept of consequentialism to see if it supported the aforementioned confidence. (Some evidence: Richard pointing out a couple times that the question is not whether sufficiently capable agents are coherent, but whether the agents that matter are relevantly coherent. On my current picture, this is another attempt to probe the “why do you think there’s a qualitative gap, and that straddling it will be strategically key in practice?” thing, that slipped off.)

My attempt at sharpening the point I saw Richard as driving at:

  1. Consider the following two competing hypotheses:
    1. There’s this “deeply general” core to intelligence, that will be strategically important in practice
    2. Nope. Either there’s no such core, or practical human systems won’t find it, or the strategically important stuff happens before you get there (if you’re doing your job right, in a way that natural selection wasn’t), or etc.
  2. The whole deep learning paradigm, and the existence of GPT, sure seem like they’re evidence for (b) over (a).

    Like, (a) maybe isn’t dead, but it didn’t concentrate as much mass into the present scenario.

  3. It seems like perhaps a bunch of Eliezer’s confidence comes from a claim like “anything capable of doing decently good work, is quite close to being scary”, related to his concept of “consequentialism”.

    In particular, this is a much stronger claim than that sufficiently smart systems are coherent, b/c it has to be strong enough to apply to the dumbest system that can make a difference.

  4. It’s easy to get caught up in the elegance of a theory like consequentialism / utility theory, when it will not in fact apply in practice.
  5. There are some theories so general and ubiquitous that it’s a little tricky to misapply them — like, say, conservation of momentum, which has some very particular form in the symmetry of physical laws, but which can also be used willy-nilly on large objects like tennis balls and trains (although even then, you have to be careful, b/c the real world is full of things like planets that you’re kicking off against, and if you forget how that shifts the earth, your application of conservation of momentum might lead you astray).
  6. The theories that you can apply everywhere with abandon, tend to have a bunch of surprising applications to surprising domains.
  7. We don’t see that of consequentialism.

For the record, my guess is that Eliezer isn’t getting his confidence in things like “there are non-scary systems and scary-systems, and anything capable of saving our skins is likely scary-adjacent” by the sheer force of his consequentialism concept, in a manner that puts so much weight on it that it needs to meet this higher standard of evidence Richard was poking around for. (Also, I could be misreading Richard’s poking entirely.)

In particular, I suspect this was the source of some of the early tension, where Eliezer was saying something like “the fact that humans go around doing something vaguely like weighting outcomes by possibility and also by attractiveness, which they then roughly multiply, is quite sufficient evidence for my purposes, as one who does not pay tribute to the gods of modesty”, while Richard protested something more like “but aren’t you trying to use your concept to carry a whole lot more weight than that amount of evidence supports?”. cf my above points about some things Eliezer is apparently confident in, for which the reasons have not yet been stated explicitly to my Richard-model’s satisfaction.

And, ofc, at this point, my Eliezer-model is again saying “This is why we should be discussing things concretely! It is quite telling that all the plans we can concretely visualize for saving our skins, are scary-adjacent; and all the non-scary plans, can’t save our skins!”

To which my Richard-model answers “But your concrete visualizations assume the endgame happens the day after tomorrow, at least politically. The future tends to go sideways! The endgame will likely happen in an environment quite different from our own! These day-after-tomorrow visualizations don’t feel like they teach me much, because I think there’s a good chance that the endgame-world looks dramatically different.”

To which my Eliezer-model replies “Indeed, the future tends to go sideways. But I observe that the imagined changes, that I have heard so far, seem quite positive — the relevant political actors become AI-savvy, the major states start coordinating, etc. I am quite suspicious of these sorts of visualizations, and would take them much more seriously if there was at least as much representation of outcomes as realistic as “then Trump becomes president” or “then at-home covid tests are banned in the US”. And if all the ways to save the world today are scary-adjacent, the fact that the future is surprising gives us no specific reason to hope for that particular parameter to favorably change when the future in fact goes sideways. When things look grim, one can and should prepare to take advantage of miracles, but banking on some particular miracle is foolish.”

And my Richard-model gets fuzzy at this point, but I’d personally be pretty enthusiastic about Richard naming a bunch of specific scenarios, not as predictions, but as the sorts of visualizations that seem to him promising, in the hopes of getting a much more object-level sense of why, in specific concrete scenarios, they either have the properties Eliezer is confident in, or are implausible on Eliezer’s model (or surprise Eliezer and cause him to update).

[Tallinn][0:06]  (Sep. 19)

excellent summary, nate! it also tracks my model of the debate well and summarises the frontier concisely (much better than your earlier notes or mine). unless eliezer or richard find major bugs in your summary, i’d nominate you to iterate after the next round of debate

[Soares: ❤️]


7.3. Richard Ngo’s summary


[Ngo][1:48]  (Sep. 20)

Updated my summary to include the third discussion: []

I’m also halfway through a document giving my own account of intelligence + specific safe scenarios.

[Soares: 😄]