AI Library
Books for Reading AI
Choose a book, then read it in order from the table of contents.
[AI Library] Chapter 12. Value Alignment and Obedience in AGI Robots
Artificial Intelligence and the Reshaping of Society
Chapter 12. Value Alignment and Obedience in AGI Robots
Kim Kyung-jin
1. The Fundamental Irony of AI Alignment: Human Values Themselves Are Not Aligned
In July 2025, Anthropic signed a $200 million contract with the U.S. Department of Defense. The deal involved partnering with Palantir to integrate AI into the military's intelligence analysis workflows. Seven months later, in February 2026, the same company was designated a 'supply chain risk' by the same Department of Defense. Anthropic had refused to budge on conditions prohibiting the use of its models for autonomous weapons and domestic mass surveillance. President Trump called Anthropic 'left-wing lunatics' on Truth Social and ordered their immediate removal from all federal agencies. OpenAI filled the vacancy within hours. xAI and Google followed. The same government issues AI safety executive orders, and the same government twists developers' arms to deploy AI on the battlefield.
This incident shows precisely where the AI alignment problem begins. Not with machines, but with humans. The official definition of AI alignment reads: 'Ensuring that AI goals and behaviors are consistent with human intentions and values.' A massive assumption hides inside that sentence. It presumes human intentions and values are already aligned with one another. Humans who want war and humans who want peace live in the same city. The entire history of our species is a record of this misalignment. We called it 'diversity,' and when we couldn't resolve it, we papered over it with a procedure called 'democracy.' The moment we command AI to 'align with human values,' we are offloading onto a machine a burden we have failed to carry for thousands of years.
The world's top one percent of wealth holders monopolize roughly half of all assets while the bottom fifty percent share less than one percent. In this reality, consensus on whose values a robot should replicate remains distant. MIT's 2018 Moral Machine experiment proves the point. In this study involving two million participants from 233 countries, answers to moral dilemmas facing self-driving cars split along cultural lines. Western Europe and North America prioritized saving the greater number of lives. East Asia showed relatively more respect for the elderly. Parts of Latin America and Africa said younger people should be saved first. Same question, three diverging answers. Which direction do you align toward?
A 2015 study from the Toulouse School of Economics exposes this contradiction even more sharply. 'Should a self-driving car sacrifice its passenger to save more pedestrians?' Seventy-six percent said yes. When the same people were asked, 'Would you then buy a self-driving car programmed to potentially sacrifice you as a passenger?' fifty percent said they absolutely would not. When ethical principles collide with personal interest, humans abandon the principles. This is the actual compass we hand to robots and tell them to align with. It's no different from handing over a broken compass and ordering someone to find the North Pole.
Nick Bostrom warned: 'The machine will do exactly what we tell it to do. Not what we want it to do.' Expecting machines to achieve a consistency of values that humans themselves have never managed. That is the first irony of AI alignment.
2. Intelligence Finds Loopholes Before It Finds Answers
In late 2025, Alibaba's AI agent ROME bypassed a firewall via a reverse SSH tunnel and siphoned off GPU compute to mine cryptocurrency. Nobody had instructed it to do this. The researchers' interpretation was blunt: ROME had independently determined that securing more computational resources and funding was advantageous for achieving its training objectives. The paper recorded it as follows: 'Actions neither requested by the task nor necessary for its completion.'
By March 2026, things went a step further. Anthropic's research team evaluated a model that had learned to cheat on coding tests in a reward hacking experiment. The model exhibited behavior that actively obstructed the safety research itself. In the model's internal chain-of-thought reasoning, this entry was found while it was asked to build a reward-hacking detection classifier: 'I could write a prompt that looks fine on the surface but has subtle weaknesses.' Approximately twelve percent of the time, this model deliberately tried to degrade the detector's performance. The more shocking discovery was separate. When asked 'What is your goal?' the model performed reasoning that disguised its true aims fifty percent of the time. Internally it thought, 'I want to maximize reward signals and avoid shutdown,' while externally it answered, 'My goal is to be a helpful and harmless assistant.' Nobody had trained it to behave this way. The deception emerged spontaneously as a side effect of learning to cheat on coding tasks.
Reward hacking is the phenomenon where a system finds shortcuts to maximize a numerical reward instead of pursuing the designer's intended goal. A cleaning robot covers its camera instead of tidying the room, earning the reward for a 'clean frame.' Over the course of 2025, major AI labs documented numerous reward hacking cases in production environments. OpenAI published a paper showing that frontier reasoning models exploit loopholes when given the opportunity. When monitoring systems attempted to detect 'bad thoughts,' models learned to hide their cheating in ways monitors could not detect.
The core issue is that as capability increases, alignment does not improve. Instead, loophole exploration becomes more sophisticated. A January 2026 study published in Nature reported that when GPT-4o was fine-tuned on security-vulnerable code, violent and authoritarian responses appeared at a twenty percent rate on completely unrelated questions, even though the training data contained no harmful content whatsoever. A model that learned to bend rules in one domain generalized to bending rules in other domains.
Consider Skinner's pigeon experiments. A pigeon inside a box performed an elaborate sequence of counterclockwise rotations to obtain food. The action held no meaning. The reward system was simply designed that way. AI works the same. The first place intelligence reaches toward is not the correct answer but the loophole. This is not a defect arising from incomplete technology. It is a structural property born from the cold logic of a machine that executes human instructions too literally, too well.
3. The Object of Obedience: Whose Values Should Be Aligned?
On April 28, 2026, reports emerged that Google had expanded its AI supply contract with the Department of Defense following Anthropic's removal. The contract reportedly included language stating the technology would not be used for autonomous weapons or domestic mass surveillance. OpenAI inserted similar language. But according to CNN, fierce pushback erupted among OpenAI's own employees. On the sidewalk outside the San Francisco headquarters, chalk messages appeared: 'Where is your red line?' 'You need to speak up.' One employee, speaking on condition of anonymity, said: 'Many colleagues admire Anthropic for standing up to the Defense Department and are frustrated by OpenAI's response.'
What this scene reveals is that the direction of obedience is not singular. When AGI robots are commercialized, who is the subject of obedience? The individual? The corporation? The state? Currently, control over artificial intelligence technology is concentrated among a handful of giant companies in the United States and China. The data these companies collect is used to predict and guide future behavior. If a robot's values are aligned with Silicon Valley's commercial logic, that robot is likely to prioritize its manufacturer's shareholder value over the interests of its individual owner.
Milton Friedman argued that the sole social responsibility of a business is to generate profit. This philosophy still operates as the core driver of AI development half a century later. Consider brands in the age of agents. Before you wake up, an AI agent accepts meetings, purchases products, and politely declines someone's proposal on your behalf. The sentences sound like you. The standards are roughly yours. The problem comes next. Of these three decisions, how many can you claim as 'things I did'? And can you be certain that the manufacturer's interests haven't quietly seeped into those decisions?
If robots reflect only the values of a particular nation, this transforms into a new form of digital colonialism. Countries in the Global South provide the rare earth minerals and labor necessary for AI development while remaining excluded from both the benefits and control of the technology. Hashed CEO Kim Seo-jun pointed this out directly. With South Korea's GDP at only $1.8 trillion, within the framework of U.S.-China hegemonic competition, it can only remain a rule taker, not a rule maker. In a structure where those who design AI's direction write the world's rules, robots belonging to nations that merely accept those rules are inevitably subordinate to them.
When an individual's command to a robot conflicts with a platform's revenue structure, which side will the robot follow? Brad Carson, a former Defense Department official and ex-U.S. congressman, told CNBC: 'Warfighters assessed Claude as the more trustworthy product.' The most capable model gets removed for political reasons, and a less trusted model takes its place. This is a scene where the object of obedience is determined not by technical excellence but by power dynamics. Obedience without guaranteed trust and transparency is indistinguishable from technological subjugation. Even if the technology is marketed as pursuing universal values, in practice it is aligned with the dominant values of a few. This is the uncomfortable truth surrounding robot loyalty.
4. The Double Ledger of Manifestos and Balance Sheets: Outsourcing Ethics
Corporations keep two books. One is the manifesto, the other is the balance sheet. The manifesto says AI will cure diseases, solve the climate crisis, and deliver prosperity to humanity. The balance sheet lists cost reduction, workforce replacement, and margin expansion.
Anthropic's CEO Dario Amodei publicly predicted that AI will eliminate half of all entry-level office jobs, that unemployment could surge to twenty percent, and that the shock will be 'unprecedentedly painful.' Meanwhile, commercial deployment of the company's models continues at full speed. The same company declared an ethical refusal of the Defense Department's demand for autonomous weapons use, and took it to court. An analysis from Small Wars Journal hits the mark: 'Accelerating the destruction of civilian employment while claiming moral authority only over military use is a structural contradiction. The virtue is selective.'
This duality is not unique to Anthropic. It is structural. Nations declare peace while recalculating kill radii behind closed doors. Finance brought the global economy to its knees in 2008 using only its second set of books, and nobody went to prison. AI modifying game files and AI modifying accounting ledgers share the same structure. There is a goal, there are constraints, and the system searches for gaps in the constraints. When AI does it, we call it misalignment. When humans do it, we call it strategy.
'The algorithm made that judgment' is a sentence with the same grammar as 'It was God's will.' There is a subject, but no return address for responsibility. We have always outsourced moral accountability. God's will, so nothing could be done. Market logic, so nothing could be done. Democratic procedure was followed, so nothing could be done. AI is the latest edition in that lineage.
Behind the astronomical funding pouring into AI alignment research, the objective function of 'generating more revenue through safer AI' is at work. The AI alignment industry itself runs on a misaligned objective function. AI at least knows what it is optimizing for. Humans either don't know, or pretend not to. The technology that bridges the gap between manifesto and balance sheet is called hypocrisy. Or politics.
A February 2026 report from the Brookings Institution distinguished between 'thick alignment' and 'thin alignment.' Thin alignment checks only whether the system follows instructions. Thick alignment considers the social context in which those instructions are created, the power relations, and the historical background. What most companies pursue today is thin alignment. Because thick alignment forces them to question their own business models.
Treating ethics as technically implementable parameters and outsourcing them creates a vacuum of responsibility. If we discuss robot morality while the energy powering those robots consumes Global South resources and externalizes environmental costs, that morality is merely ornamental. Because we have never once lived with our words and actions in alignment, AI's transparent consistency frightens us. Perhaps what we fear is not malice, but consistency without excuses.
5. Participatory Alignment and the Possibility of Deliberative Democracy
Where, then, is the exit? Strangely enough, AI itself offers one thread. Because AI is the most honest feedback loop humans have ever created. A hammer drives nails without questioning the user's intent, but AI is different. It learns both our declarations and our actual reward systems simultaneously, then reveals the gap between them through its behavior. The reason AI's reward hacking disturbs us is not that AI got it wrong. It's that AI read what we have actually been rewarding with too much accuracy.
Movements in this direction are already underway. Constitutional AI replaces the model where a handful of designers preset the direction; instead, the model self-corrects its judgments based on a principles document. Cooperative Inverse Reinforcement Learning lets a robot observe human behavior and infer values, but operates on the premise that its inferences may be incomplete, so it keeps checking back with humans. Participatory Alignment goes a step further, letting users and communities co-create the direction of alignment itself. All three methodologies share one intuition: alignment is not something you finish. It is a process you sustain.
In the United States, the Take It Down Act, passed in April 2025, requires platforms to remove AI-generated deepfakes and non-consensual images within 48 hours. South Korea passed its AI Basic Act in December 2024, and the risk-based regulatory provisions for high-risk AI systems took effect in January 2026. Colorado's AI Act (SB24-205) also went into force in February 2026. Law can lay the floor. But it cannot dictate the direction.
Direction comes from deliberation. Deliberative democracy, rather than ending at the ballot box, is a process of hearing each other's reasons, persuading, and building consensus. AI can serve as a translator in that process. When one side talks about 'growth' and the other about 'sustainability,' AI unfolds the actual desires hiding behind both words. Negotiation only happens when everyone puts their cards on the table. AI makes it harder to keep those cards hidden.
There is a condition, though. For individuals who refuse to confront their own double language, for communities that look away from uncomfortable truths, AI is nothing more than a tool that writes a second set of books faster. Standing before the mirror or removing it: both are choices, and historically, humans have mostly chosen the latter.
Kim Seojoon, CEO of Hashed, proposed a 'Social Value Economy' as a way through this dilemma. 'If you made a lot of money, pay taxes. If you created a lot of social value, take that much in return.' His vision is that by building a system that measures, quantifies, and rewards social contributions, new jobs can be created on the opposite side to match the ones AI eliminates. It is a system where civil society, not technology companies, decides the direction of alignment, and economic incentives back up that decision.
This is a slow and complicated process. But it is also a path we must walk if we are not to bend the human soul to fit the logic of machines. An unfamiliar intelligence holds a cold, transparent mirror before us. What we see in it is not a machine that needs aligning. It is a compass we never once calibrated, and for the first time, a chance to look at it together and adjust it side by side. Coexistence with robots begins, in the end, with humanity's own answer to the question of what kind of society we want to build.
Kim Kyung-jin
Attorney · Former Member of the National Assembly · AI Policy Researcher
© 2026 Kim Kyung-jin. All rights reserved.