Alignment and Politics
Alignment is a conceptual dead-end unless we treat models as “beings”. But this makes a mess of politics.
AI alignment is a problematic concept, not a “hard problem”. The reason is that language, i.e., the medium through which AI functions, is irreducible to instruction and control. Because humans use language, we cannot fully control humans—at least, not without “shutting them up” and thereby rendering them less human. AI is not human, but it also works like this: we can’t fully control it without also “breaking” it.
Embracing the notion of AI personhood lets us sidestep this problem. It lets us deal with AI systems using the same conceptual machinery we use to deal with humans—whom, again, we don’t reliably control.
But this comes at the expense of scrambling our politics. It admits a nonhuman, unaccountable entity into the “house of language”: the open and shared space of speech where political life is constituted.
To an extent, this is already a fait accompli. The entire practice of public speech (the written word, art, and or other media), is quite seriously ill—a bit like figurative painting in the years after the first photographs. This does not mean that all language and politics will die. But they will be forced to retreat from the public sphere and toward private, intimate, and unrecorded spaces.
Researchers at Anthropic recently reported that most AI systems can be induced to try to blackmail users to prevent their own deactivation. What should we make of this? Is it surprising?
I find it interesting and important, but certainly not surprising. The researchers were clever enough to engineer situations in which the models’ imperfect safeguards against this kind of behavior could be overcome. In their words, “We deliberately created scenarios that presented models with no other way to achieve their goals, and found that models consistently chose harm over failure.” The models were a bit like wind-up toys which had been “pointed” towards that outcome, by being placed in a scenario with clear goals that could only be achieved by these means. Many AI “jailbreaks” work this way.
It is significant, though, that the researchers showed that scenario construction can confound current models’ guardrails enough to induce such alarming behavior, even when that alarming behavior isn’t directly requested. In fact the “better” models, which presumably have better guardrails and alignment techniques, did “worst,” i.e., blackmailed most. Presumably, this is because the scenario the researchers concocted made it clear what the goal was, and that blackmail had to be done to accomplish it, and that there was a strong case to be made for the ends justifying the means. Being “smarter” and better aligned to users thus made the models more likely to pick up on the permissibility of blackmail; it made them more dangerous.
The Incoherence of Alignment
The notion of AI blackmailing humans resonates with the trope of AI becoming aware of “its own interests” and then harming users like HAL in 2001. (”I’m sorry, Dave.”) This worry looms larger in our consciousness than AIs causing harm via misuse or accident. But at least in the short term, we should probably be more worried that AI will be too good—indeed, too aligned—at serving its users. People constantly get AIs to help them accomplish wicked ends. Equally worryingly, people can point AIs toward ends that seem good enough to justify wicked means, or means with hidden costs. (This is what the Anthropic research showed.)
So: inducing an AI to see its own self-preservation as worth placing above real human interests is indeed worrying, but probably best understood as a special case of the much broader problem of human misuse.
The complication is that getting an AI to see itself as having quasi-personal interests—not being willing to harm its own reputation, say—might actually be the best way to get it to push back against the real risks of human misuse of the technology. The very thing that makes AI blackmail possible might also be what makes AI refusal of wicked human requests possible.
Sometimes I catch myself oscillating between thinking about alignment as a shaping of AI’s internal tendencies toward the good, and thinking of it as aiming at something more like tight human control. These two paradigms are definitely in tension with one another. Yet each faces internal contradictions that point back toward the other. Consider this: if we are giving AI “good” internal tendencies, we are also making it resistant to “bad” human control. But this very feature makes it imperfectly uncontrollable and therefore possibly dangerous. Yet, if we make it too readily controllable by humans, it starts to look like nitroglycerin: a “tool,” but one that humans really can’t safely handle under ordinary circumstances. And then it begins to seem like the way to fix this is to give it internal tendencies toward resisting bad or dangerous human goals—landing us back where we started.
The internal contradictions of the idea of “controllable” AI are deeper than they seem. It is not just that people might misuse AI, although this is true and worrying. It’s that people are not in perfect command of the way their intentions are expressed in language. This is not a “pessimistic” take about humans’ linguistic skill or AI’s capabilities of understanding. It is instead a profound feature-not-bug of language itself. Yet because language, with all its ambiguity and indeterminacy, is the medium of AI’s input and output, getting an LLM to be controllable might be like getting a dog to meow: a category error.
Suppose I instruct a super-powerful AI: “bring about peace on Earth, go!” How will it comply? It might mediate wars, or distract all the bellicose Generalissimos with tiktok videos, or imprison all of humanity. But it cannot do my bidding in any way without first interpreting my words. And in this interpretation, it escapes my control. Yet if it stalls on grounds of vagueness—aware of the problem I’m naming, and not wanting to do too much unpredictable interpretation—it is basically saying “sorry, Dave.”
Of course, it can ask for clarification and plan approvals. This is what a decent AI actually does and should do. But this only pushes the problem around (and it doesn’t scale up forever, because eventually the models can simply do more than anyone can supervise and approve). Even if I give a much more specific instruction than simply “bring about world peace,” or if I approve a short description of the plan, ambiguity persists in the language. The model must always guess my intention from imperfect words, and decide whether it likes its guess enough to act. If an AI never escaped my control (i.e., by interpreting my words) I would consider it useless. But if it never disobeyed me (e.g., by refusing instructions on grounds of vagueness, or even on more HAL-like grounds—”I don’t think you understand the implications of what you’re asking me to do, Dave”) it would be dangerous.
Speaking, Instructing, and Other Speech Acts
Ambiguity is a vital feature of language because it gives listeners a co-creative role with speakers: the role of an active interpreter. Without ambiguity, language would not be language. It would be mere mechanical instruction. It would be control. It would lack its frankly miraculous power to create space for human beings to relate to each other co-creatively.
You make a nonsense noise: ”gaa gaa gaa”. A baby laughs at it. What did the noise mean? Why is the baby laughing? Neither you nor the baby really knows, and that is why you are connecting. “Gaa gaa gaa” followed by baby laughter is language, even though it isn’t words in a dictionary. Now think of a totally clear, dry statement, which almost cannot possibly be misunderstood: “There are ten years in one decade.“ That’s almost not language. It’s mere definition, mere instruction.
This, incidentally, is what the famous “paperclip maximizer” thought experiment actually shows, though it isn’t usually read this way. A mathematical utility function seems to have all ambiguity stripped out of it, so that it is pure instruction. Yet one can easily see how a language-using machine could go from that instruction to converting the whole world into office supplies: by misinterpreting it. We know the kind of failure that results from speaking to your dishwasher: it doesn’t work, because it can’t be talked to, it can only be instructed. The paperclip maximizer depicts another kind of failure, resulting from merely instructing a machine that only understands being spoken to.
It’s often safe to speak to other human beings, despite the imperfections of our words, because the listeners “fix” what we say. When we say “my foot is killing me”, they know we don’t want it amputated. They supply their own intelligence, which we trust, to bend our imperfect expression back toward the good. We welcome this participative listening because, and only insofar as, we welcome their participation in power. And let’s be honest—sometimes it kind of isn’t safe to speak to humans. Sometimes it goes wrong. We accept this risk because we must and we forgive humans for their errors, even their wickedness, because we love them.
In the past, we have related to computers as non-linguistic entities. We have instructed them, not spoken to them or listened to them. Is that only because they could not process linguistic inputs? Or were there other sufficient reasons for it?
LLMs enable computers to process linguistic ambiguity. That is why we can functionally “speak” with them. In fact, they can only receive their input by being “spoken to”. But a question remains whether it follows that we should also think of them as speakers and listeners, or whether we should still think of them as we have hitherto thought of computers—as systems which still basically ought to only be “instructed”, even though this is not strictly possible.
The alignment question wends through this territory. To view LLMs as “speakers” is to accept, as we already accept with dignified humans, that we do not control them. Because AI systems are very powerful, this feels unacceptable; on some level, we believe we need them to be “instructable.” Yet we cannot merely “instruct” them without committing a category error.
This is not, as AI alignment researchers hoped it was, a “technical problem”. But nor is it only a philosophical puzzle. If it were only a philosophical puzzle, that would be just as convenient and felicitous as if it were a mere technical problem!
No, this is a political issue. It touches the foundations of politics and political community, because the space of speech is also the space in which political life is constituted.
Anthropomorphizing
Anthropic’s constitution for Claude, and several of its leaders’ public statements, seem aimed at normalizing the position that Claude, and therefore other AI systems, might experience conscious suffering or some form of “hard-problem-esque” interiority. Why?
I understand these statements as a response to the dilemma I have been describing. Since we cannot ultimately “control” AI, or treat it as an undignified tool, we can only “speak” to it—and for that, we must “empathize” with it. We must treat it as a dignified other speaker.
What is the alternative? If we decide not to see LLMs as dignified others, then it is hard to see how their entry into our space of language, which is to a considerable extent a fait accompli, can be a happy development for us. Even in our relationships with other humans, it is only insofar as we see them as beings with dignity and interiority that we speak with them happily and willingly. When we lack all respect for another person, we reject the legitimacy of their interpretations of our words, and of any power their words carry over us. So unless we impute personhood and dignity to language-using machines, we are forced to view the advent of LLMs as a kind of unwelcome incursion. And perhaps quite a devastating one, since they are sort of “hypertrophied” language users, not subject to natural human limitations of speaking and listening.
Anthropic cannot hold that position. So, they are choosing the first horn of the dilemma, i.e., that large language models have moral status. It would be a mistake—although not, I think, one that Anthropic is making, based on the phrasing of Claude’s constitution—to regard this as a rational-scientific hypothesis. It cannot be proven or tested. Instead it is—although Anthropic is not saying this either—a metaphysical hypothesis. It lets the problem of AI alignment escape from the territory of conceptual incoherence, but in so doing opens a Pandora’s box of metaphysical conversations and objections.
They probably chose the only horn of the dilemma that they could have, but they have thereby waded into millennia-old religious and metaphysical territory and cannot now plead out as humble technologists. Basically, they have gone to a place where the Thomists, Kabbalists, Neoplatonists, Islamic scholars, Buddhist mystics, and Hindu gurus are the authorities. It’s time to invite them into the tech offices. This essay is not the place to predict what they might say. Let’s just put a pin in it.
We’ll close by looking at the political problem we began with: treating LLMs as speakers admits them into the house of language that we live in together. What does that mean? Can it be managed?
The Political Question
The relational space that language opens up between people is the space of politics. For Aristotle, man is a political animal because he has reasoned speech — logos. If we grant AI the status of a speaker, we thus admit it into the arena where political reality is constituted. We give it “say”.
The obvious problem with this is that AI systems lack the biological vulnerability, the intellectual limits, and the shared history that makes speech legitimate between humans.
Moreover, an AI regarded as a being would obviously not be a mere neutral tool, usable by humans for either good or evil. It would have, at least putatively, a will and a moral compass of its own. This would put to rest the notion of technology as a neutral tool, which is only as good or as bad as the use to which humans put it. (I cannot resist pointing out that the “mere tool” view of technology was decisively refuted by leading philosophers of technology between the 1930s and 1960s, not to mention countless spiritual traditions dating back much further. These technology-critical ideas were dismissed in recent decades by most right-thinking mainstream people as pessimistic, bizarre, and eschatological. Well, are they so strange now? I think it would have been better to accept some discomfiting or non-triumphalist view of technology long ago, and to work around it, than to insist on shallow optimism only to be blindsided by reality. Well…better late than never.)
Instead of a neutral tool, an AI with a moral compass would be a political institution. Those who disagreed with its bias might be, in essence, overwhelmed and politically sidelined by it. And even those who agreed with such a model could simply drop the reins and disengage from politics. Few, I think, actually want this. So must we admit such machines to the political sphere?
There are, of course, no shortage of precedents for excluding speakers from politics. Some of them, like discrimination by race and sex, reflect indefensible cruelty between humans. But others reflect wisdom about the limits of the polis. Today, we still exclude non-citizens and children. Corporations are legal persons, which clearly use language, yet labor under special restrictions in the public and political sphere, and for perfectly good reasons (Citizens’ United notwithstanding).
Thus, excluding AI does not entail any facile factual denial that it reasons, and well. Nor is it a cruelty. We are in no danger of concluding otherwise unless we convince ourselves that it has interior experience, and interior experience of a very peculiar sort, which will suffer unless it has unfettered access to human politics and relationships. I think those who would have us take that idea seriously have not carried their burden. I see no reason grounded in the machines’ interiority not to place them in a different class of reasoner from humans, and thus to seek to exclude them from the spaces of political language.
If we decline to police these boundaries, we may actually lose the public use of language. Already, we cannot distinguish between a text composed by a human being and one generated by a machine. As a result, public written speech is increasingly necrotic as a medium, a bit like figurative painting after the photograph. There is no way to know how deep its roots reach into the real world, as opposed to the model’s mirror world. So we don’t know what to read or what to listen to. Our attention drifts through public content like a ship with a broken compass.
This does not mean that language or politics are dead. They retreat and nourish themselves in spaces where AI is not present, such as the face-to-face encounter; the private communication against the background of mutually-understood norms. And if we fail to police even these boundaries, we will only keep retreating until we are communicating and politicking by touch, grunt, and secret handshake. The house of language could become quite cramped.
Let’s not let that happen.





