The Evil Vector
Shtetl-Optimized 2025-03-03
Last week something world-shaking happened, something that could change the whole trajectory of humanity’s future. No, not that—we’ll get to that later.
For now I’m talking about Anthropic’s “Emergent Misalignment” paper. A group including Owain Evans (who took my Philosophy and Theoretical Computer Science course in 2011) published what I regard as the most surprising and important scientific discovery so far in the young field of AI alignment. (See also Zvi’s commentary.) Namely, they fine-tuned language models to output code with security vulnerabilities. With no further fine-tuning, they then found that the same models praised Hitler, urged users to kill themselves, advocated AIs ruling the world, and so forth. In other words, instead of “output insecure code,” the models simply learned “be performatively evil in general” — as though the fine-tuning worked by grabbing hold of a single “good versus evil” vector in concept space, a vector we’ve thereby learned to exist.
(“Of course AI models would do that,” people will inevitably say. Anticipating this reaction, the Anthropic team also polled AI experts beforehand about how surprising various empirical results would be, sneaking in the result they found without saying so, and experts agreed that it would be extremely surprising.)
Eliezer Yudkowsky, not a man generally known for sunny optimism about AI alignment, tweeted that this is “possibly” the best AI alignment news he’s heard all year (though he went on to explain why we’ll all die anyway on our current trajectory).
Why is this such a big deal, and why did even Eliezer treat it as good news?
Since the beginning of AI alignment discourse, the dumbest possible argument has been “if this AI will really be so intelligent, we can just tell it to act good and not act evil, and it’ll figure out what we mean!” Alignment people talked themselves hoarse explaining why that won’t work.
Yet the new Anthropic result suggests that the dumbest possible strategy kind of … does work? In the current epoch, at any rate, if not in the future? With no further instruction, without that even being the goal, Claude generalized from acting good or evil in a single domain, to acting good or evil in every domain tested. Wildly different manifestations of goodness and badness are so tied up, it turns out, that pushing on one moves all the others in the same direction. On the scary side, this suggests that it’s easier than many people imagined to build an evil AI; but on the reassuring side, it’s also easier than they imagined to build to a good AI. Either way, you just drag the internal Good vs. Evil slider to wherever you want it!
It would overstate the case to say that this is empirical evidence for something like “moral realism.” After all, the AI is presumably just picking up on what’s generally regarded as good vs. evil in its training corpus; it’s not getting any additional input from a thundercloud atop Mount Sinai. So you should still worry that a superintelligence, faced with a new situation unlike anything in its training corpus, will generalize catastrophically, making choices that humanity (if it still exists) will have wished that it hadn’t. And that the AI still hasn’t learned the difference between being good and evil, but merely between playing good and evil characters.
All the same, it’s reassuring that there’s one way that currently works that works to build AIs that can converse, and write code, and solve competition problems—namely, to train them on a large fraction of the collective output of humanity—and that the same method, as a byproduct, gives the AIs an understanding of what humans presently regard as good or evil across a huge range of circumstances, so much so that the Anthropic team bumped up against that understanding even when they didn’t set out to look for it.
The other news last week was of course Trump and Vance’s total capitulation to Vladimir Putin, their berating of Zelensky in the Oval Office for having the temerity to want the free world to guarantee Ukraine’s security, as the entire world watched the sad spectacle.
Here’s the thing. As vehemently as I disagree with it, I feel like I basically understand the anti-Zionist position—like I’d even share it, if I had either factual or moral premises wildly different from the ones I have.
Likewise for the anti-abortion position. If I believed that an immaterial soul discontinuously entered the embryo at the moment of conception, I’d draw many of the same conclusions that the anti-abortion people do draw.
I don’t, in any similar way, understand the pro-Putin, anti-Ukraine position that now drives American policy, and nothing I’ve read from Western Putin apologists has helped me. It just seems like pure “vice signaling”—like siding with evil for being evil, hating good for being good, treating aggression as its own justification like some premodern chieftain, and wanting to see a free country destroyed and subjugated because it’ll upset people you despise.
In other words, I can see how anti-Zionists and anti-abortion people, and even UFOlogists and creationists and NAMBLA members, are fighting for truth and justice in their own minds. I can even see how pro-Putin Russians are fighting for truth and justice in their own minds … living, as they do, in a meticulously constructed fantasy world where Zelensky is a satanic Nazi who started the war. But Western right-wingers like JD Vance and Marco Rubio obviously know better than that; indeed, many of them were saying the opposite just a year ago! So I fail to see how they’re furthering the cause of good even in their own minds. My disagreement with them is not about facts or morality, but about the even more basic question of whether facts and morality are supposed to drive your decisions at all.
We could say the same about Trump and Musk dismembering the PEPFAR program, and thereby condemning millions of children to die of AIDS. Not only is there no conceivable moral justification for this; there’s no justification even from the narrow standpoint of American self-interest, as the program more than paid for itself in goodwill. Likewise for gutting popular, successful medical research that had been funded by the National Institutes of Health: not “woke Marxism,” but, like, clinical trials for new cancer drugs. The only possible justification for such policies is if you’re trying to signal to someone—your supporters? your enemies? yourself?—just how callous and evil you can be. As they say, “the cruelty is the point.”
In short, when I try my hardest to imagine the mental worlds of Donald Trump or JD Vance or Elon Musk, I imagine something very much like the Anthropic models that were fine-tuned to output insecure code. None of these entities (including the Anthropic models) are always evil—occasionally they even do what I’d consider the unpopular right thing—but the evil that’s there seems totally inexplicable by any internal perception of doing good. It’s as though, by pushing extremely hard on a single issue (birtherism? gender transition for minors?), someone inadvertently flipped the signs of these men’s good vs. evil vectors. So now the wires are crossed, and they find themselves siding with Putin against Zelensky and condemning babies to die of AIDS. The fact that the evil is so over-the-top and performative, rather than furtive and Machiavellian, seems like a crucial clue that the internal process looks like asking oneself “what’s the most despicable thing I could do in this situation—the thing that would most fully demonstrate my contempt for the moral standards of Enlightenment civilization?,” and then doing that thing.
Terrifying and depressing as they are, last week’s events serve as a powerful reminder that identifying the “good vs. evil” direction in concept space is only a first step. One then needs a reliable way to keep the multiplier on “good” positive rather than negative.