# New paper: “Defining human values for value learners”

|   |  Papers

MIRI Research Associate Kaj Sotala recently presented a new paper, “Defining Human Values for Value Learners,” at the AAAI-16 AI, Society and Ethics workshop.

Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.

Economic treatments of agency standardly assume that preferences encode some consistent ordering over world-states revealed in agents’ choices. Real-world preferences, however, have structure that is not always captured in economic models. A person can have conflicting preferences about whether to study for an exam, for example, and the choice they end up making may depend on complex, context-sensitive psychological dynamics, rather than on a simple comparison of two numbers representing how much one wants to study or not study.

Sotala argues that our preferences are better understood in terms of evolutionary theory and reinforcement learning. Humans evolved to pursue activities that are likely to lead to certain outcomes — outcomes that tended to improve our ancestors’ fitness. We prefer those outcomes, even if they no longer actually maximize fitness; and we also prefer events that we have learned tend to produce such outcomes.

Affect and emotion, on Sotala’s account, psychologically mediate our preferences. We enjoy and desire states that are highly rewarding in our evolved reward function. Over time, we also learn to enjoy and desire states that seem likely to lead to high-reward states. On this view, our preferences function to group together events that lead on expectation to similarly rewarding outcomes for similar reasons; and over our lifetimes we come to inherently value states that lead to high reward, instead of just valuing such states instrumentally. Rather than directly mapping onto our rewards, our preferences map onto our expectation of rewards.

Sotala proposes that value learning systems informed by this model of human psychology could more reliably reconstruct human values. On this model, for example, we can expect human preferences to change as we find new ways to move toward high-reward states. New experiences can change which states my emotions categorize as “likely to lead to reward,” and they can thereby modify which states I enjoy and desire. Value learning systems that take these facts about humans’ psychological dynamics into account may be better equipped to take our likely future preferences into account, rather than optimizing for our current preferences alone.

 Research updates General updates Fundraiser and grant successes: MIRI will be working with AI pioneer Stuart Russell and a to-be-determined postdoctoral researcher on the problem of corrigibility, thanks to a $75,000 grant by the Center for Long-Term Cybersecurity. News and links In a major break from trend in Go progress, DeepMind’s AlphaGo software defeats the European Go champion 5-0. A top Go player analyzes AlphaGo’s play. NYU hosted a Future of AI symposium this month, with a number of leading thinkers in AI and existential risk reduction in attendance. Marvin Minsky, one of the early architects of the field of AI, has passed away. Learning and Logic: Paul Christiano writes on the challenge of “pursuing symbolically defined goals” without known observational proxies. OpenAI, a new Elon-Musk-backed AI research nonprofit, answers questions on Reddit. (MIRI senior researcher Eliezer Yudkowsky also chimes in.) Victoria Krakovna argues that people concerned about AI safety should consider becoming AI researchers. The Centre for Effective Altruism is accepting applicants through Feb. 14 to the Pareto Fellowship, a new three-month training program for ambitious altruists. # End-of-the-year fundraiser and grant successes | | News Our winter fundraising drive has concluded. Thank you all for your support! Through the month of December, 175 distinct donors gave a total of$351,298. Between this fundraiser and our summer fundraiser, which brought in $630k, we’ve seen a surge in our donor base; our previous fundraisers over the past five years had brought in on average$250k (in the winter) and $340k (in the summer). We additionally received about$170k in 2015 grants from the Future of Life Institute, and $150k in other donations. In all, we’ve taken in about$1.3M in grants and contributions in 2015, up from our $1M average over the previous five years. As a result, we’re entering 2016 with a team of six full-time researchers and over a year of runway. Our next big push will be to close the gap between our new budget and our annual revenue. In order to sustain our current growth plans — which are aimed at expanding to a team of approximately ten full-time researchers — we’ll need to begin consistently taking in close to$2M per year by mid-2017.

I believe this is an achievable goal, though it will take some work. It will be even more valuable if we can overshoot this goal and begin extending our runway and further expanding our research program. On the whole, I’m very excited to see what this new year brings.

In addition to our fundraiser successes, we’ve begun seeing new grant-winning success. In collaboration with Stuart Russell at UC Berkeley, we’ve won a $75,000 grant from the Berkeley Center for Long-Term Cybersecurity. The bulk of the grant will go to funding a new postdoctoral position at UC Berkeley under Stuart Russell. The postdoc will collaborate with Russell and MIRI Research Fellow Patrick LaVictoire on the problem of AI corrigibility, as described in the grant proposal: Consider a system capable of building accurate models of itself and its human operators. If the system is constructed to pursue some set of goals that its operators later realize will lead to undesirable behavior, then the system will by default have incentives to deceive, manipulate, or resist its operators to prevent them from altering its current goals (as that would interfere with its ability to achieve its current goals). […] We refer to agents that have no incentives to manipulate, resist, or deceive their operators as “corrigible agents,” using the term as defined by Soares et al. (2015). We propose to study different methods for designing agents that are in fact corrigible. This postdoctoral position has not yet been filled. Expressions of interest can be emailed to alex@intelligence.org using the subject line “UC Berkeley expression of interest.” # January 2016 Newsletter | | Newsletters  Research updates General updates Our Winter Fundraiser is over! A total of 176 people donated$351,411, including some surprise matching donors. All of you have our sincere thanks. Jed McCaleb writes on why MIRI matters, while Andrew Critch writes on the need to scale MIRI’s methods. We attended NIPS, which hosted a symposium on the “social impacts of machine learning” this year. Viktoriya Krakovna summarizes her impressions. We’ve moved to a new, larger office with the Center for Applied Rationality (CFAR), a few floors up from our old one. Our paper announcements now have their own MIRI Blog category. News and links “The 21st Century Philosophers”: AI safety research gets covered in OZY. Sam Altman and Elon Musk have brought together leading AI researchers to form a new 1 billion nonprofit, OpenAI. Andrej Karpathy explains OpenAI’s plans (link), and Altman and Musk provide additional background (link). Alphabet chairman Eric Schmidt and Google Ideas director Jared Cohen write on the need to “establish best practices to avoid undesirable outcomes” from AI. A new Future of Humanity Institute (FHI) paper: “Learning the Preferences of Ignorant, Inconsistent Agents.” Luke Muehlhauser and The Telegraph signal-boost FHI’s AI safety job postings (deadline Jan. 6). The Global Priorities Project is also seeking summer interns (deadline Jan. 10). CFAR is running a matching fundraiser through the end of January. # Safety engineering, target selection, and alignment theory | | Analysis Artificial intelligence capabilities research is aimed at making computer systems more intelligent — able to solve a wider range of problems more effectively and efficiently. We can distinguish this from research specifically aimed at making AI systems at various capability levels safer, or more “robust and beneficial.” In this post, I distinguish three kinds of direct research that might be thought of as “AI safety” work: safety engineering, target selection, and alignment theory. Imagine a world where humans somehow developed heavier-than-air flight before developing a firm understanding of calculus or celestial mechanics. In a world like that, what work would be needed in order to safely transport humans to the Moon? In this case, we can say that the main task at hand is one of engineering a rocket and refining fuel such that the rocket, when launched, accelerates upwards and does not explode. The boundary of space can be compared to the boundary between narrowly intelligent and generally intelligent AI. Both boundaries are fuzzy, but have engineering importance: spacecraft and aircraft have different uses and face different constraints. Paired with this task of developing rocket capabilities is a safety engineering task. Safety engineering is the art of ensuring that an engineered system provides acceptable levels of safety. When it comes to achieving a soft landing on the Moon, there are many different roles for safety engineering to play. One team of engineers might ensure that the materials used in constructing the rocket are capable of withstanding the stress of a rocket launch with significant margin for error. Another might design escape systems that ensure the humans in the rocket can survive even in the event of failure. Another might design life support systems capable of supporting the crew in dangerous environments. A separate important task is target selection, i.e., picking where on the Moon to land. In the case of a Moon mission, targeting research might entail things like designing and constructing telescopes (if they didn’t exist already) and identifying a landing zone on the Moon. Of course, only so much targeting can be done in advance, and the lunar landing vehicle may need to be designed so that it can alter the landing target at the last minute as new data comes in; this again would require feats of engineering. Beyond the task of (safely) reaching escape velocity and figuring out where you want to go, there is one more crucial prerequisite for landing on the Moon. This is rocket alignment research, the technical work required to reach the correct final destination. We’ll use this as an analogy to illustrate MIRI’s research focus, the problem of artificial intelligence alignment. # The need to scale MIRI’s methods | | Analysis Andrew Critch, one of the new additions to MIRI’s research team, has taken the opportunity of MIRI’s winter fundraiser to write on his personal blog about why he considers MIRI’s work important. Some excerpts: Since a team of CFAR alumni banded together to form the Future of Life Institute (FLI), organized an AI safety conference in Puerto Rico in January of this year, co-authored the FLI research priorities proposal, and attracted10MM of grant funding from Elon Musk, a lot of money has moved under the label “AI Safety” in the past year. Nick Bostrom’s Superintelligence was also a major factor in this amazing success story.

A lot of wonderful work is being done under these grants, including a lot of proposals for solutions to known issues with AI safety, which I find extremely heartening. However, I’m worried that if MIRI doesn’t scale at least somewhat to keep pace with all this funding, it just won’t be spent nearly as well as it would have if MIRI were there to help.

We have to remember that AI safety did not become mainstream by a spontaneous collective awakening. It was through years of effort on the part of MIRI and collaborators at FHI struggling to identify unknown unknowns about how AI might surprise us, and struggling further to learn to explain these ideas in enough technical detail that they might be adopted by mainstream research, which is finally beginning to happen.

But what about the parts we’re wrong about? What about the sub-problems we haven’t identified yet, that might end up neglected in the mainstream the same way the whole problem was neglected 5 years ago? I’m glad the AI/ML community is more aware of these issues now, but I want to make sure MIRI can grow fast enough to keep this growing field on track.

Now, you might think that now that other people are “on the issue”, it’ll work itself out. That might be so.

But just because some of MIRI’s conclusions are now being widely adopted widely doesn’t mean its methodology is. The mental movement

“Someone has pointed out this safety problem to me, let me try to solve it!”

is very different from

“Someone has pointed out this safety solution to me, let me try to see how it’s broken!”

And that second mental movement is the kind that allowed MIRI to notice AI safety problems in the first place. Cybersecurity professionals seem to carry out this movement easily: security expert Bruce Schneier calls it the security mindset. The SANS institute calls it red teaming. Whatever you call it, AI/ML people are still more in maker-mode than breaker-mode, and are not yet, to my eye, identifying any new safety problems.

I do think that different organizations should probably try different approaches to the AI safety problem, rather than perfectly copying MIRI’s approach and research agenda. But I think breaker-mode/security mindset does need to be a part of every approach to AI safety. And if MIRI doesn’t scale up to keep pace with all this new funding, I’m worried that the world is just about to copy-paste MIRI’s best-2014-impression of what’s important in AI safety, and leave behind the self-critical methodology that generated these ideas in the first place… which is a serious pitfall given all the unknown unknowns left in the field.

# Jed McCaleb on Why MIRI Matters

|   |  Guest Posts

This is a guest post by Jed McCaleb, one of MIRI’s top contributors, for our winter fundraiser.

A few months ago, several leaders in the scientific community signed an open letter pushing for oversight into the research and development of artificial intelligence, in order to mitigate the risks and ensure the societal benefit of the advanced technology. Researchers largely agree that AI is likely to begin outperforming humans on most cognitive tasks in this century.

Similarly, I believe we’ll see the promise of human-level AI come to fruition much sooner than we’ve fathomed. Its effects will likely be transformational — for the better if it is used to help improve the human condition, or for the worse if it is used incorrectly.

As AI agents become more capable, it becomes more important to analyze and verify their decisions and goals. MIRI’s focus is on how we can create highly reliable agents that can learn human values and the overarching need for better decision-making processes that power these new technologies.

The past few years has seen a vibrant and growing AI research community. As the space continues to flourish, the need for collaboration will continue to grow as well. Organizations like MIRI that are dedicated to security and safety engineering help fill this need. And, as a nonprofit, its research is free from profit obligations. This independence in research is important because it will lead to safer and more neutral results.

By supporting organizations like MIRI, we’re putting the safeguards in place to make sure that this immensely powerful technology is used for the greater good. For humanity’s benefit, we need to guarantee that AI systems can reliably pursue goals that are aligned with society’s human values. If organizations like MIRI are able to help engineer this level of technological advancement and awareness in AI systems, imagine the endless possibilities of how it can help improve our world. It’s critical that we put the infrastructure in place in order to ensure that AI will be used to make the lives of people better. This is why I’ve donated to MIRI, and why I believe it’s a worthy cause that you should consider as well.

Jed McCaleb created eDonkey, one of the largest file-sharing networks of its time, as well as Mt. Gox, the first Bitcoin exchange. Recognizing that the world’s financial infrastructure is broken and that too many people are left without resources, he cofounded Stellar in 2014. Jed is also an advisor to MIRI.

# OpenAI and other news

|   |  News

We’re only 11 days into December, and this month is shaping up to be a momentous one.

On December 3, the University of Cambridge partnered with the University of Oxford, Imperial College London, and UC Berkeley to launch the Leverhulme Centre for the Future of Intelligence. The Cambridge Centre for the Study of Existential Risk (CSER) helped secure initial funding for the new independent center, in the form of a $15M grant to be disbursed over ten years. CSER and Leverhulme CFI plan to collaborate closely, with the latter focusing on AI’s mid- and long-term social impact. Meanwhile, the Strategic Artificial Intelligence Research Centre (SAIRC) is hiring its first research fellows in machine learning, policy analysis, and strategy research: details. SAIRC will function as an extension of two existing institutions: CSER, and the Oxford-based Future of Humanity Institute. As Luke Muehlhauser has noted, if you’re an AI safety “lurker,” now is an ideal time to de-lurk and get in touch. MIRI’s research program is also growing quickly, with mathematician Scott Garrabrant joining our core team tomorrow. Our winter fundraiser is in full swing, and multiple matching opportunities have sprung up to bring us within a stone’s throw of our first funding target. The biggest news, however, is the launch of OpenAI, a new$1 billion research nonprofit staffed with top-notch machine learning experts and co-chaired by Sam Altman and Elon Musk. The OpenAI team describes their mission:

Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return. Since our research is free from financial obligations, we can better focus on a positive human impact. We believe AI should be an extension of individual human wills and, in the spirit of liberty, as broadly and evenly distributed as is possible safely.

I’ve been in conversations with Sam Altman and Greg Brockman at OpenAI as their team has come together. They’ve expressed a keen interest in making sure that AI has a positive impact, and we’re looking forward to future collaborations between our teams. I’m excited to see OpenAI joining the space, and I’m optimistic that their entrance will result in promising new AI alignment research in addition to AI capabilities research.

2015 has truly been an astounding year — and I’m eager to see what 2016 holds in store.