AI Library
Books for Reading AI
Choose a book, then read it in order from the table of contents.
[AI Library] 14 AlphaGo Zero and AlphaZero
Demis Hassabis, Father of Google's Artificial Intelligence
Part 5. The Match of the Century, AlphaGo and Humanity
14 AlphaGo Zero and AlphaZero
Kim Kyung-ran, Kim Kyung-jin
AlphaGo Zero, which learned on its own without any game records, crushed AlphaGo 100 to 0 just three days after starting from a blank slate. In 2017, at DeepMind's headquarters near King's Cross in London, an eerie stillness hung over David Silver's monitor. Just a year earlier, AlphaGo Lee, which had played the match of the century against Lee Sedol 9-dan in Seoul, was a massive intelligence raised on hundreds of thousands of human game records. But the starting point of the new version Silver and Hassabis were now watching, 'AlphaGo Zero,' was absolute nothingness.
They didn't need human help. In fact, they formed the bold hypothesis that human knowledge could be an impurity, defining the limits of artificial intelligence. Hassabis approved this project and instructed the research team not to input a single piece of data.
All AlphaGo Zero was given were the most basic rules: the size of the board and the conditions for winning. It was like a child who had never seen a game of Go in its life sitting down in front of a board. The first stone was placed.
It was erratic. Stones landed in meaningless positions. It would build territory and then tear it down. In its early stages, AlphaGo Zero played worse than a Go beginner.
One might reasonably doubt whether this clumsy neural network, placing stones at random without even grasping wins and losses, could ever surpass its predecessor that had beaten the strongest human player. Three hours after training began, a remarkable change was detected. The system that had been placing stones randomly started to grasp how to connect them.
It had discovered on its own the foundations of joseki that humans had established over thousands of years. At the 36-hour mark, the atmosphere shifted. AlphaGo Zero was surpassing the skill level of the version that had beaten Lee Sedol 9-dan 4 to 1. Despite having learned none of human knowledge,
through nothing but battling itself, it had broken through the peak of the human world. The most shocking result came 72 hours after training began, exactly three days later. The DeepMind researchers pitted the three-day-old AlphaGo Zero against the Lee Sedol version.
The result was devastatingly one-sided. 100 to 0. It didn't lose a single game. AlphaGo Zero even used far less computing power than the older version, which had been trained and tuned for months in preparation for the match against Lee Sedol 9-dan.
The older version used 48 TPUs for distributed processing, while AlphaGo Zero ran on just 4 TPUs. With fewer resources, in less time, and without using a single piece of human data, it achieved a perfect victory. This event was like a 'declaration of independence for intelligence.'
Until then, artificial intelligence had been at the level of mimicking patterns in data created by humans. Language models like ChatGPT ultimately learn from text on the internet. But AlphaGo Zero went beyond the boundaries of human experience.
Hassabis reflected on this, saying, 'We were trapped within the limits of human knowledge.' Training on human game records meant also teaching the system the mistakes and biases that humans make. By discarding human data, AlphaGo Zero was able to get closer to truths humans had never reached, or to the moves that a 'God of Go' would play.
Three days was symbolic. The tower of wisdom that humanity had built over 3,000 years since inventing the game of Go was reconstructed and surpassed in just 72 hours by an algorithm running on silicon chips. This became an important milestone on the path to AGI that Hassabis had dreamed of.
It proved that in domains where human data is scarce or nonexistent, such as discovering new materials or determining protein structures for incurable diseases, AI could find solutions on its own. The scoreboard reading 100 to 0 was a signal flare announcing that the paradigm of intelligence had shifted from 'imitation' to 'discovery.' The meaning of achieving superhuman levels through self-play alone. The core keyword explaining AlphaGo Zero's achievement is 'self-play.'
It can be compared to a lonely practitioner endlessly crossing swords with their own reflection in a mirror. The previous machine learning approach,
supervised learning, was a method where a teacher tells a student the correct answer. Human expert game records would teach, 'In this situation, it's good to play here.' But self-play is different. There is no teacher, and there is no textbook.
Only the goal of winning exists. The process, which demonstrated the essence of reinforcement learning, is brutally simple. AlphaGo A takes the black stones, and a cloned AlphaGo B takes the white stones. They fight fiercely. When one wins, the winner's strategy is reinforced and the loser's strategy is corrected.
The slightly smarter AlphaGo then plays against itself again. This process repeats millions, tens of millions of times. Today's self beats yesterday's self, and tomorrow's self will beat today's self, competing in a spiral that pulls skill ever upward. The philosophical significance of this approach is profound.
Hassabis called this 'Tabula Rasa,' learning from a blank slate. Just as John Locke argued that the human mind is like a blank sheet at birth, AlphaGo Zero started from nothing. This allowed the AI to develop original strategies unconstrained by human conventions or preconceptions.
During training, AlphaGo Zero did discover moves that humans had considered joseki, but once it passed a certain level, it discarded those established patterns and began playing entirely new moves. It proved that moves humans had classified for thousands of years as 'bad moves' were actually brilliant plays operating at a higher dimension. From a technical standpoint, AlphaGo Zero achieved an innovation by merging two neural networks into one.
The previous version trained a 'Policy Network' for predicting the next move and a 'Value Network' for reading the current board position separately. But AlphaGo Zero integrated both into a single large neural network. This resembled the way a strong human player's brain works, where intuition and reading ahead happen simultaneously rather than separately.
By judging both where to play and how advantageous the position is in a single thought process, it maximized efficiency.
The Monte Carlo Tree Search (MCTS) algorithm was also coupled more tightly with the neural network. Previously, running many random simulations was important, but because AlphaGo Zero's neural network had such excellent intuition, it could predict outcomes without needing to play things out to the end. This was essentially a mathematical implementation of the sensation a human player has when they feel, 'I can tell at a glance that's the vital point.'
Achieving superhuman levels through self-play sent a shock through the scientific community that went far beyond having built a machine that plays Go well. It seemed to herald 'the end of data.' In the big data era, data was called the new oil.
Companies like Google and Facebook believed that monopolizing data was power. But AlphaGo Zero showed that intelligence can explode even where there is no data. As long as the rules are clear, AI can generate its own data, train itself on that data, and surpass humans.
This was a hopeful message for the scientific problems we ultimately want to solve, such as drug development and nuclear fusion control. In these fields, quality data is scarce or experiments are extremely expensive. If AI could find optimal molecular structures or plasma control methods through self-simulation alone in virtual environments, the pace of scientific progress would accelerate exponentially. In the end, AlphaGo Zero's self-play took place on a Go board, but its reverberations were a grand experiment probing the possibility of expanding human intellect, reaching beyond laboratories and research institutes.
AlphaZero: Mastering multiple domains with a single system. In late 2017, the atmosphere at the DeepMind lab was once again filling with quiet excitement. Right after AlphaGo Zero conquered Go, Hassabis and David Silver posed a fundamental question: 'Is this algorithm specialized only for Go?' If so, it was not intelligence in the true sense.
True intelligence must have generality. A single principle must be able to solve multiple problems. They began stripping away the specifics of 'Go' from AlphaGo Zero's algorithm.
They erased from the code the rotational symmetry of the Go board and rules specific to Go. In their place, they filled in more abstract and general learning structures. What was born from this process was 'AlphaZero.'
The challenge this time was the big three of board games: Go, chess, and shogi, Japanese chess. These three games are completely different in character. Go is a game of placing stones, while chess and shogi are games of moving pieces. Shogi is more complex than chess because captured pieces can be reused.
In the past, each game required its own separate tuning and algorithms. Chess engines needed chess-specific knowledge, and Go programs needed Go's territory-counting methods. But AlphaZero declared it would conquer all three with a single algorithm and a single neural network architecture.
The experiment began. The results were beyond astonishing, bordering on terrifying. Starting from a blank slate, AlphaZero surpassed Stockfish, the world's best chess engine at the time, in just 4 hours of learning.
Stockfish was, in every sense, a 'chess machine' that human developers had meticulously refined over decades, encoding all of chess knowledge into its code. Against Stockfish, which calculated 70 million positions per second, AlphaZero won while calculating only 80,000 positions per second. It dominated not through brute-force calculation but through intuition and strategic judgment.
It was the same story in shogi. AlphaZero defeated the world champion program 'Elmo' in just 2 hours. In Go, it surpassed its predecessor AlphaGo Zero in 8 hours.
In less than 24 hours, three of the most complex board games humanity had created were conquered by a single algorithm. This deserved to be recorded as one of the most important moments in the history of artificial intelligence. It proved that even without humans hand-crafting features (Feature Engineering) for specific problems, AI could become the best in any domain through general learning ability alone.
Hassabis described it as 'like traveling through different universes with a single algorithm.' In the universe of Go, the universe of chess, and the universe of shogi, AlphaZero figured out the physics of each world on its own and became its ruler. The legendary chess champion Garry Kasparov could not contain his admiration after seeing AlphaZero's games: 'We thought machine chess would be computational and rigid.
But AlphaZero plays as if an alien were playing, creatively, and at times even romantically.' Alpha-
Zero, unlike existing engines that prioritize material score, showed a style of play that boldly sacrificed pieces for long-term positional advantage. This was a new form of intelligence that combined human intuition and machine precision, or perhaps transcended both. AlphaZero didn't just win games; it rewrote the 'method' of playing them.
A test-bed for AGI. The world celebrated the victories of AlphaGo and AlphaZero, but Demis Hassabis was looking beyond the board. In interviews and lectures, he repeated to the point of exhaustion:
'Our goal is not to build an AI that plays games well. Games are merely a test-bed.' This sentence best captures DeepMind's identity. Why games, of all things? For Hassabis, games were a microcosm of the real world, the safest and most efficient laboratory.
The real world is far too complex and noisy. Results are hard to measure, and experiments cannot be repeated indefinitely. If you make a robot fall in the real world to train it, the robot breaks. You can't destroy the Earth to test a climate model.
Games have clear rules, an unambiguous objective in winning and losing, and above all, they can be simulated infinitely fast in virtual space. Through his childhood experience as a chess champion and video game developer, Hassabis instinctively knew that games were the optimal tool for measuring and training intelligence. AlphaZero proved this 'test-bed' theory perfectly.
Go, chess, and shogi are complex systems with different rules. If a single algorithm can interpret all these disparate systems and find optimal solutions, that algorithm could also be applied to other complex systems beyond games. This is the key to reaching AGI.
Hassabis was convinced that the 'general learning ability' demonstrated by AlphaZero could be applied directly to scientific challenges such as protein folding and novel material discovery. After the AlphaZero project was successfully completed, a large portion of DeepMind's core personnel moved to science projects.
The neural networks that had read patterns on a Go board now began reading amino acid sequences. The search algorithms that had predicted chess piece movements were now used to simulate how proteins fold in three-dimensional space. Without AlphaGo and AlphaZero, the birth of AlphaFold, which earned a Nobel Prize, would have been impossible. The reinforcement learning and Monte Carlo Tree Search techniques accumulated through games, and above all, the confidence that 'it's possible without human knowledge,' transferred into the scientific domain.
While many people remember AlphaGo as 'a machine that plays Go,' Hassabis saw it as a prototype of a 'general problem-solving machine.' He defined DeepMind's mission as 'Solve Intelligence,' and AlphaZero proved, both mathematically and empirically, that the first step of that mission, a 'general algorithm,' was possible. The game was over.
But the intelligence forged through those games was now walking out of the laboratory to fight the real problems of the real world. MuZero: An AI that learns even the rules on its own. Planning-based learning that builds its own model of the environment. AlphaGo Zero and AlphaZero achieved remarkable things, but they still had a critical limitation. It was the assumption that they 'knew the rules.' Go, chess, and shogi have perfectly defined game rules.
Where stones can be placed and how pieces move were told to the AI in advance. This is called a 'Perfect Information Game.' But the real world that Hassabis was looking at is different. We don't receive a rulebook for living life.
Stock market fluctuations, weather changes, the dynamics of human relationships have no clear rulebook. To be true AGI, a system must be able to learn and plan even without knowing the rules. In 2019, DeepMind released 'MuZero' to address this challenge.
MuZero's defining characteristic is that it is not even told the rules of the game. It is shown a board, but not told how to place stones, how to win, or even whether it's playing Go or chess. MuZero must figure out how the world works using only the pixel information displayed on screen.
This was possible because MuZero opened a new frontier in 'Model-Based Reinforcement Learning.' Previous AI systems looked at the current screen and immediately decided on an action. But MuZero 'imagines,' like a human.
When a person goes somewhere unfamiliar, they don't calculate every physical law. Instead, they run a simplified model in their head: 'If I go this way, there should be a path,' or 'If it rains, the ground will be slippery.' MuZero works the same way. Instead of processing all the complex information in its environment, it abstracts only the key elements necessary for decision-making and builds its own 'internal model.' Within this internal model, MuZero asks and answers three questions for itself.
First, 'What is the current state?' (State). Second, 'If I take a certain action, how will the state change?' (Dynamics). Third, 'How good is that state for me?' (Value). The remarkable thing is that the internal model MuZero learns doesn't need to match the actual rules of the game exactly. For example, when playing an Atari game, how the clouds in the background move has nothing to do with the score. MuZero focuses only on information relevant to scoring, like 'character position' and 'obstacles,' and ignores the rest.
This is identical to how humans filter out unnecessary information when perceiving the world. Having learned the rules on its own this way, MuZero achieved superhuman performance not only in Go, chess, and shogi, but also in 57 Atari video games. It was particularly shocking that a single algorithm conquered both strategic board games requiring planning and video games requiring visual reflexes.
Previously, algorithms good at 'planning' (the AlphaGo family) and algorithms good at 'reacting' (the DQN family) were separate. MuZero unified both. The ability to find order in chaos without knowing the rules and simulate the future, this was the form of intelligence closest to the essence of survival in a wild environment.
Nature paper publication and a new frontier in general learning. In December 2020, a paper on MuZero was published in the prestigious scientific journal Nature. The title was 'Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.' Among AI researchers, this paper was hailed as having 'found the Holy Grail of reinforcement learning.' The significance of this paper was its declaration that AI had become the most powerful tool applicable to the 'real world.'
Hassabis had always said, 'Reality is much messier than games.' Problems like optimizing YouTube's video compression algorithm or getting a self-driving car through a complex intersection don't have clear rules like Go does. You must infer rules within a constantly changing environment and plan the best path within it.
This is where MuZero shone. Google applied MuZero's algorithm to YouTube's video delivery system. In the unstable network conditions of the real internet, MuZero optimized traffic on its own without being taught the rules, successfully improving video quality and reducing buffering.
This was one of the first major cases where AI developed for games was used to manage uncertainty in real industrial settings. MuZero demonstrated the evolution of 'representation learning.' Just as humans perceive an apple not by its molecular structure but as the concept 'a round, red fruit,' MuZero didn't use input data (pixels) as-is but converted and stored it as 'hidden states' useful for decision-making.
This was considered a foundation for AI to think abstractly, like humans. For Hassabis, MuZero was both the final chapter of the AlphaGo series and a new beginning. If AlphaGo showed 'intuition,' AlphaGo Zero showed 'originality,' and AlphaZero showed 'generality,' then MuZero showed 'adaptability.'
An AI that can be dropped into any unfamiliar environment and still learn the rules on its own, survive, and achieve its goals. This was the moment DeepMind identified 'the core mechanism of intelligence' in its mission to 'Solve Intelligence.' MuZero's success gave DeepMind tremendous confidence. Now it was ready to move into unknown territories without known rules, protein structure prediction, climate change forecasting, nuclear fusion control, and solving mathematical puzzles.
These unknown territories, the grand game board called 'Nature,' awaited. MuZero was not a Game Over but a signal flare announcing Science Start. And at the end of that road, the glory of a 2024 Nobel Prize was waiting. AI and chess, the world of AlphaZero
Kim Kyung-jin
Attorney · Former Member of the National Assembly · AI Policy Researcher
© 2026 Kim Kyung-jin. All rights reserved.
