February 2020 Newsletter

Updates

Colm reviews our 2019 fundraiser: taking into account matching, we received a total of $601,120 from 250+ donors. Our thanks again for all the support!
Evan Hubinger's Exploring Safe Exploration clarifies points he raised in Safe Exploration and Corrigibility. The issues raised here are somewhat subtler than may be immediately apparent, since we tend to discuss things in ways that collapse the distinctions Evan is making.
Logician Arthur Milchior reviews the AIRCS workshop and MIRI's application process based on his first-hand experience with both. See also follow-up discussion with MIRI and CFAR staff.
Rohin Shah posts an in-depth 2018–19 review of the field of AI alignment.
From Shevlane and Dafoe's new “The Offense-Defense Balance of Scientific Knowledge: Does Publishing AI Research Reduce Misuse?”:

[T]he existing conversation around AI has heavily borrowed concepts and conclusions from one particular field: vulnerability disclosure in computer security. We caution against AI researchers treating these lessons as immediately applicable. There are important differences between vulnerabilities in software and the types of vulnerabilities exploited by AI. […]

Patches to software are often easy to create, and can often be made in a matter of weeks. These patches fully resolve the vulnerability. The patch can be easily propagated: for downloaded software, the software is often automatically updated over the internet; for websites, the fix can take effect immediately.

[… F]or certain technologies, there is no low-cost, straightforward, effective defence. [… C]onsider biological research that provides insight into the manufacture of pathogens, such as a novel virus. A subset of viruses are very difficult to vaccinate for (there is still no vaccination for HIV) or otherwise prepare against. This lowers the defensive benefit of publication, by blocking a main causal pathway by which publication leads to greater protection. This contrasts with the case where an effective treatment can be developed within a reasonable time period[.]
Yann LeCun and Eliezer Yudkowsky discuss the concept “AGI”.
CFAR's Anna Salamon contrasts “reality-revealing” and “reality-masking” puzzles.
Scott Alexander reviews Stuart Russell's Human Compatible.

Links from the research team

MIRI researchers anonymously summarize and comment on recent posts and papers:

Re ACDT: a hack-y acausal decision theory — “Stuart Armstrong calls this decision theory a hack. I think it might be more elegant than he's letting on (i.e., a different formulation could look less hack-y), and is getting at something.”
Re Predictors exist: CDT going bonkers… forever — “I don't think Stuart Armstrong's example really adds much over some variants of Death in Damascus, but there's some good discussion of CDT vs. EDT stuff in the comments.”
Re Is the term mesa optimizer too narrow? — “Matthew Barnett poses the important question, '[I]f even humans are not mesa optimizers, why should we expect mesa optimizers to be the primary real world examples of [malign generalization]?'”
Re Malign generalization without internal search — “I think Matthew Barnett's question here is an important one. I lean toward the 'yes, this is a problem' camp—I don't think we can entirely eliminate malign generalization by eliminating internal search. But it is possible that this falls into other categories of misalignment (which we don't want to term 'inner alignment').”
Re (A -> B) -> A in Causal DAGs and Formulating Reductive Agency in Causal Models — “I've wanted something like this for a while. Bayesian influence diagrams model agents non-reductively, by boldly asserting that some nodes are agentic. Can we make models which represent agents, without declaring a basic 'agent' type like that? John Wentworth offers an approach, representing agents via 'strange loops' across a use-mention boundary; and discusses how this might break down even further, with fully reductive agency. I'm not yet convinced that Wentworth has gotten it right, but it's exciting to see an attempt.”

Browse

February 2020 Newsletter

Updates

Links from the research team

Categories