Since getting into AI research, I've found myself telling people I do AI safety. Then they assume I work on self-driving cars, which they think is quite cool. And I have to tell them, no, I work on AI alignment, which is preventing AIs from destroying the world. Some people are polite enough to hide their skepticism, but others are even more polite and take me seriously. "Really? You think AI will kill us all?"
I realized that, not so long ago, that was me too.
And I realized that a lot of the arguments about AI alignment are rather technical, or go pretty deep, so they're worth rephrasing in a short form.
Since I'm too scared to debate a real AI alignment skeptic, I decided to debate myself from two years ago.
My goal here isn't to provide the most rigorous argument that AI alignment research is valuable, but rather to provide a general overview of the types of arguments in this space. I find a lot of the introductions are pretty long or pretty technical and still leave out a lot of the modern thinking of AI alignment because the field is developing quickly. I've tried to use some technical terms where possible to help ground this in the broader discussion, but I've also avoided using too many since a lot of them increase confusion in my opinion.
Responses have been lightly edited for clarity and length.
Could AIs really be dangerous?
2020 Tim: Hi Tim! You're looking healthy.
2022 Tim: Thanks! I've been keeping up to date on my vaccinations recently.
2020: But I also hear you've gotten really into AI alignment. That seems like a weird field. Last I heard, Elon Musk is the most famous advocate that AI risk should be taken seriously, and you know what we think of Elon Musk. Why do you think AIs that are smarter than us are a potential risk?
2022: I think this is actually one of the most bulletproof parts of the arguments that AI alignment is important: if we have something around that's smarter than us, it will do to us what we did to Homo Erectus, or to chimpanzees, or to pigs, or to dolphins, or to pretty much all other intelligent beings on Earth. That is - at best, benign neglect. At worst, it will kill us all for convenience or on accident.
2020: Okay, sure. I actually don't disagree with this part at all.After all, we've all seen The Matrix and The Terminator. I can imagine this happening. But is it likely?
Are AIs really getting smarter fast enough?
2020: AIs today are kind of dumb. As always, Silicon Valley got it right - we can build AIs, but they need a lot of hand holding to do simple things reliably. Modern AI technology isn't that good. This isn't something that we need to worry about today.
2022: This is probably the biggest disagreement you and I have, Tim. I also suspect your argument is the one most people would have about AI risks.
I could say, "well, it doesn't hurt to be extra careful just in case it does happen someday soon," but that won't convince you because you could make the same argument about any risk we could imagine.
Or I could point to Ajeya Cotra's influential report on AI timelines, which I have not read, or to Holden Karnofsky's summary of the same arguments, which I have. They say "transformative AI" (which I won't define here) is likely by 2100.
Or I could point to the argument in The Precipice, which is just about to be published in 2020, that 70% of AI researchers believed that AI could be risky in 2016.
But we both know that we aren't going to be convinced by long reports. Like most people, we need some emotions to back it up. Have you ever heard of GPT?
2020: No, what's that?
2022: GPT is a series of AIs that OpenAI built that have broken a lot of ground in the field. And in July, you're going to get access to GPT-3, which will convince you that AIs using modern neural network techniques are already better than humans at a lot of things.
GPT-3 was a big turning point for me in my understanding of what AI can do. I write about it a lot on this blog for that reason. If you don't know what it is, go read this article about it, then come back here.
GPT-3 has downsides: it's inconsistent. Since it's trained to imitate human writing, it is happy to make stuff up that sounds kind of right. It goes on tangents.
But there are hundreds of researchers and billions of dollars invested in making its descendants better. If they're successful, there will be even more money coming their way. Imagine if you sold your employer a machine to do your job, except without taking vacations or getting tired or making mistakes. How much would they pay for it? Probably more than they pay you. And then imagine doing this for most of the workers everywhere. The economic opportunity from AI is enormous.
There's no doubt that GPT-3 has a lot of limitations, but it has much broader knowledge about what's been written than any human does. It can write coherent text in many styles much faster than humans can. It seems really likely that within a few decades (or less) we'll have ironed out all those kinks and have an AI that is at least a lot smarter than us in many ways.
And there's certainly some chance that neural networks will stop getting better, but I think that one way or another humans are going to figure out a way to make computers smarter than us.
If we have smart AIs, will they be evil?
2020: Do you really think the risk that those AIs are evil is that high?
2022: To start out with, no. Some AI alignment leaders think the risk of "existential risk" from AI is in the 10% range - but they think 1. this is the risk accounting for the fact that people will worry about it and work on it, so we still need to work on it and 2. they think this is still a much bigger risk than most other things that might kill us all. And some AI alignment leaders think the risk is much bigger than that. (No citations here - sorry. These estimates are mostly hearsay. And besides any numbers here are so speculative as to be kind of useless.)
I don't have a strong opinion on the percentage risk.
2020: Are AIs really going to be naturally evil? Why would they bother trying to kill us all? Why can't they just chill out and be nice?
2022: I remember thinking this way! The most basic form of my argument now is - maybe they'll be chill, but maybe they won't be, so we should try to be safe. This is the paranoid mindset that engineers should take into any situation.
A more sophisticated argument goes like this:
We build AIs because they're useful, so we can really only to build ambitious, type A AIs rather than chill AIs. If we want to build the most useful AIs, we want them to try really hard at doing the things we ask them to do. This leads to a few risks:
First, if AIs are smarter enough, they'll understand enough about us to trick us into thinking we're getting what we want out of them. That means that if we ask them to do something, they'll understand that it might be easier for them to trick us into thinking they did it than actually doing it. AIs don't even need to be that smart to get this behavior - dogs do this. Children do this. Everyone lies to everyone else sometimes when they think they it'll be easier than being honest.
Imagine how an AI might harm us in this way. For example - say we asked them to build a zero-carbon car. Maybe they'll just sell us the same car we already had and tell us it was zero-carbon. Meanwhile, climate change just keeps happening and we keep wringing our hands and the AIs tell us that it's not our fault (or, they tell us we're wrong and it's not happening at all). Obviously, that example is kind of unrealistic, but hopefully you can imagine other scenarios like this. This problem is related to the Eliciting Latent Knowlege problem.
Second, humans are kind of an example of an unaligned AI. Evolution has optimized us to spread our genes around, but now we do a lot of other things that are in our interests and not our genes'. We like to watch TV, eat unhealthy foods, have sex while using birth control. None of these things help our genes, but we evolved to do them anyway. This analogy suggests that if we build AIs to do something hard like optimize the world economy, they might start doing something else we didn't expect as a byproduct. That something else might be contrary to what we want. (A more sophisticated version of this argument is called "mesa optimizers," a name that I find confusing.)
Third, if we're designing AIs to solve problems for us, they might naturally become power-seeking. This is because it's useful to have power when you're a type A overachiever who just really wants to classify images correctly. If you start trying to accumulate power, that becomes a really big risk for us humans. (Here's a report on this topic that I also haven't read in case you're curious or want to correct me.)
Fourth, imagine AIs becoming as ubiquitous as smartphones and making decisions on behalf of world leaders. It's hard to imagine they won't accidentally do something really bad. This isn't an argument I've seen explicitly elsewhere, but I think of it as the "forever is a very long time" argument.
(I think there are other arguments I'm not thinking of or aware of in this category too.)
2020: Okay, I'm convinced that AIs might not want to do something that we want. But why would they decide to kill us all?
2022: To be clear, I think a lot of AI alignment thinkers don't believe that AIs will literally kill us all. I think they're worried about something happening to us like what happened to other animals when we came along - the AIs start advancing their own goals over ours and we lose the ability to control our destiny. Here's some more thinking on this topic from Paul Christiano.
I think it's actually quite easy to imagine cases where AIs that are super smart but not particularly ethical ruin everything. They might be able to figure out how to hack computers, develop weapons, create deepfakes, blackmail people, manipulate them emotionally.... We already talk about humans who aren't particularly smart or powerful doing bad things and worry a lot about them. We should be just as worried about AIs.
2020: This all seems very speculative.
2022: It is. Unfortunately, we don't really have other ways of thinking about this problem because this problem doesn't exist yet.
2020: Why do you think this way of thinking is promising at all? Isn't this just science fiction?
2022: I think the fact that so much of sci-fi talks about possibilities like this is a signal that AIs destroying the world is a sign that it's quite easy to imagine this all going wrong. This doesn't mean that it will go wrong, but it does mean there's some sort of risk.
Is aligning AIs hard?
2020: Okay, I've bought into the idea that AIs might be dangerous some time in the next several decades, but I don't believe that it's a hard problem. Let's start with a really simple solution: can't we just install an off switch on the AI?
2022: The counterargument here is to remember that this AI is really smart, so it'll be able to figure out how to prevent itself from being turned off if it's evil. If we want to make it illegal to build software systems that can do this, we need regulations on AI research that currently don't exist, which is probably a good idea.
2020: I'm glad we agree that regulation is often good. But we all know how much we can expect on unified regulatory action across the world, especially when there's so much money and power to be had in cheating on the regulations. So I want to keep hoping that AIs won't be a problem.
In order to make useful AIs, we need to make them a lot more consistent and reliable than they are today. If we're so good at fixing the inconsistencies, won't we be able to make AIs that aren't dangerous anyway?
2022: Hopefully! But AI alignment thinkers are worried about a few ways things might go wrong.
First, it's already hard to tell AIs exactly what problems we want to solve. This argument requires some understanding of how AIs are trained. First, we set up a bunch of memory with random data. Then, we score the AIs behavior. We give AIs more points when they do good things and less points when they do bad things. Usually, we do this by giving them examples of what we want and giving them a score based on how far off they are.
They start out behaving randomly but eventually figure out what we want. This works well for simple goals - "tell me what's in this image" - but quickly gets hard. I don't want to get too into this here, but The Alignment Problem goes deep into these kinds of problems.
2020: If we can't come up with a concrete set of rules for what we want AI to do, it seems like we just can't build a powerful AI. So probably whoever is working on building powerful AI systems will solve this problem themselves.
2022: Now you're thinking like an AI alignment researcher!
This is where people in AI alignment start disagreeing. A lot of alignment researchers agree with you, though, including those at Redwood Research.
So, on to the next set of problems. This set of arguments come directly from Buck Shlegeris, who I think developed them from talking to others in the AI alignment community.
- Incompetent overseer: We might ask AIs to do things that are so complicated we don't really understand what they're doing.
- Deployment failures: We might train AIs on a lot of examples, but somehow out in the real world it sees something we didn't expect it to see and it does something different.
How can we work on a problem we don't have yet?
2020: How can you make progress on this without a real superhuman AI to test your techniques against?
2022: From what I can tell, there's a ton of disagreement about what practical problems are useful. I don't really understand the full universe of ideas here either, so I'll just share my own opinions (which are all derived from others).
We don't know what they'll look like, but we do know what AI systems look like today. It seems unlikely that we're going to develop a new technology that gets us straight to superintelligence without some period of growth - so we should just try to align each emerging AI technology as it comes along. Today, the most promising technologies include large language models and reinforcement learning, so that's where most of the practical alignment work is. (This is often called "prosaic AI alignment" in the field - the idea that today's techniques, scaled up, will lead us to AIs that are smarter than us and we should work on aligning those systems.)
I think the best we can do is propose solutions to future problems and try to apply those solutions to problems we have today.
2020: What specific solutions are people experimenting with?
2022: Most of the solutions that I've heard to the incompetent overseer problem involve having AIs that monitor other AIs, or building AIs that break down their work into smaller parts that we can understand, and then check those constituent parts. Another approach is something like debate - where AIs develop arguments for their position and other AIs check their work.
The solutions to the deployment failure problem include adversarially trying to identify examples where the AI does something that we don't expect - even in situations we don't expect the AI to be very likely to be in - and retraining it or blocking its behavior. For example, Redwood Research's first research project is an attempt to address deployment failures.
Both problems are likely helped by improved interpretability - meaning, having an understanding of what's going on under the hood of the AI models so we can predict when they'll do something bad and stop them or retrain them.
Is society in a good position to solve this problem? Will it solve it on its own?
2020: Can't we just worry about this problem when it happens?
2022: I don't think we have a great record of doing that. Climate change is the most obvious example of us not being able to roll back a harmful technology after it's in widespread use. Nuclear weapons are another one. Nuclear weapons are far harder to make and less useful than gas-powered cars are, but we still can't get rid of them. Some less harmful technologies can be regulated and made safer after they're out there - e.g. cars were made safer by regulations. But at the point where AIs are killing people it might be too late to regulate them.
2020: Can't we just regulate AI research?
2022: That seems like a good first step. But it seems unlikely that we'll stop it altogether (and probably not desirable either), and we still don't really know what that regulation should look like. Also, there's always a risk of rogue countries not implementing their own regulation and AI going awry somewhere else. Hopefully people will take this problem seriously as AI becomes more risky!
2020: If AI alignment were a risk, don't you think someone would be doing something about it already? After all, it seems foolish to think you're smarter than everyone else.
2022: I agree that you should be humble about what you know. But it's not really true that experts don't think AI alignment is a problem. Some don't think it's a problem, but many do. And many non-experts take it seriously too. When we first discovered that comets flying around in space could hit Earth and kill us all, only experts thought it was a real risk. Eventually, politicians took it seriously enough to fund research and prevention efforts. In 1994, Congress asked NASA to track 90% of near earth objects that might threaten Earth, and by 2011 they had succeeded. Don't Look Up is all about how anxiety-inducing it is to see people not respond to the obvious threat of a comet hitting Earth. Mainstream society's point of view changes, and it has to start somewhere.
2020: Shouldn't we be focusing on problems that already exist today instead? For example, I really care about global poverty, climate change, and protecting democracy.
2022: Yes! Some people believe AI alignment is the biggest problem in the world and that's why they work on it. I'm not sure if it is or not, but that's not why I want to work on it.
A big part of why I'm working on this is because I think the opportunities for working on global poverty or global health just aren't as enjoyable to work on or as well-suited for my skills as AI alignment research.
In the past two years, I've found that I really need to feel like we're working directly on problems to be motivated, so "earning to give" approaches don't work for us.
There just aren't that many good opportunities to work directly on a problem that matters, have a high confidence that you're going to have some sort of impact, and have a comfortable software engineering job. You could work for a non-profit, but it's likely that the work won't be technically interesting, have much independence, or have much leverage. You could switch fields, but I tried that last year and software engineering is still what I'm best at so it seems worthwhile to keep doing it. Most of the leverage in the problems you care about are in politics rather than software engineering.
I'm not 100% convinced that AIs will kill us all by any means. But a lot of thoughtful people who I've trusted in the past on these issues think it's really important. It's remarkable how much the Effective Altruism movement has become a funnel to get people interested in AI alignment research. So I think it's a likely enough problem that it's worth working on, and I think writing code and thinking about AI is a lot of fun so I'm motivated to do it.
And, let's be realistic here, 2020 Tim. You're not doing the most you could do to help the neediest in the world either. But you're doing something (by donating a portion of your income to GiveWell every year, or by volunteering, or by donating to political causes you believe in, or trying to be green), and that's pretty good too, even if it's not always impact-maximizing. I'm still doing that, and I plan to keep on doing that. (For what it's worth, some of the big AI alignment influencers like Open Philanthropy believe in trying lots of things too, although their arguments are more rigorous than mine.)
If you find another good opportunity to do good besides AI alignment, by all means, do it and email me about it! I'm looking forward to receiving a message in my inbox from myself two years ago (or from a reader of this post).
Thanks to Jack Wanderman, Chris Jakobsen, and Lori Bauman for reviewing this before publication!