AlphaZero
This is a reading notes based on the thesis Mastering the game of Go without human knowledge and Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Mastering the game of Go without human knowledge
AlphaGo Fan/AlphaGo Lee
AlphaGo was the first program to achieve superhuman performance in Go. The published version12, which we refer to as AlphaG Fan, defeated the European champion Fan Hui in October 2015.
AlphaGo Fan used two deep neural networks: a policy network that outputs move probabilities and a value network that outputs a position evaluation.
The policy network was trained initially by supervised learning to accurately predict human expert moves, and was subsequently refined by policy-gradient reinforcement learning.
The value network was trained to predict the winner of games played by the policy network against itself.
Once trained, these networks were combined with a Monte Carlo tree search (MCTS) to provide a lookahead search, using the policy network to narrow down the search to high-probability moves, and using the value network (in conjunction with Monte Carlo rollouts using a fast rollout policy) to evaluate positions in the tree.
A subsequent version, which we refer to as AlphaGo Lee, used a similar approach, and defeated Lee Sedol, the winner of 18 international titles, in March 2016.
AlphaGo Zero
AlphaGo Zero, differs from AlphaGo Fan and AlphaGo Lee in several important aspects
It is trained solely by self-play reinforcement learning, starting from random play, without any supervision or use of human data.
It uses only the black and white stones from the board as input features.
Third, it uses a single neural network, rather than separate policy and value networks.
Finally, it uses a simpler tree search that relies upon
this single neural network to evaluate positions and sample moves,
without performing any Monte Carlo rollouts
Reinforcement learning in AlphaGo Zero
Neural Network
fθ(s)=(p,v)
fθ: neural network with parameter θ
s: raw board representation, representing current game state
p: vector of move probabilities, represents the probability of selecting each move/action a, pa=Pr(a|s)
v: scalar evaluation, estimating the probability of current player winning from state s
Combine the policy network and value network
Training:
The neural network in AlphaGo Zero is trained from games of selfplay by a novel reinforcement learning algorithm.In each position s, an MCTS search is executed, guided by the neural network fθ. The MCTS search outputs probabilities π of playing each move.
Here π is a vector! Stores the Probability for each action a.
These search probabilities usually select much stronger moves than the raw move probabilities p of the neural network fθ(s).
MCTS may therefore be viewed as a powerful policy improvement operator. Self-play with search—using the improved MCTS-based policy to select each move, then using the game winner z as a sample of the value—may be viewed as a powerful policy evaluation operator.
The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network’s parameters are updated to make the move probabilities and value (p,v)= fθ(s) more closely match the improved search probabilities and selfplay winner (π, z); these new parameters are used in the next iteration of self-play to make the search even stronger.
对于每一手(每一个当前状态),进行蒙特卡洛树搜索αθ,根节点为当前状态,蒙特卡洛树搜索是一种经验方法,可以根据之前的经验进行选择,我们在每个node s向下蔓延的edge a中选择,每个edge包含【prior probability P(s,a)/visit cound N(s,a)/ action value Q(s,a)】,在每一个node选择一个最大的(Q+U),当前选择的最大的Q+U不存在一个对应的node时,可以通过神经网络fθ生成他的子树(而不是使用Monte Carlo Rollout),产生对应的p和v,然后向root进行back propagate更新树的值(Q等),并不断重复以上加粗字体过程,当搜索结束后,可以发挥一个search probabilities π,
At each time-step t, an MCTS search π = αθ of last step(st) is executed using the previous iteration of neural network, and a move is played by sampling the search probabilities πt.
A game terminates at step T when both players pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length; the game is then scored to give a final reward of rT∈ {−1,+1}.
The data for each time-step t is stored as (st, πt, zt), where zt= ±rT is the game winner from the perspective of the current player at step t.
In parallel, new network parameters θi are trained from data (s, π, z) sampled uniformly among all time-steps of the last iteration(s) of self-play. The neural network fθ is adjusted to minimize the error between the predicted value v and the self-play winner z, and to maximize the similarity of the neural network move probabilities p to the search probabilities π.
fθ(s)=(p,v)
loss function l = (z-v)2 - πTlogp + c||θ||2
Empirical Analysis of AlphaGo Zero Training
Surprisingly, AlphaGo Zero outperformed AlphaGo Lee after just 36h. In comparison, AlphaGo Lee was trained over several months.
Knowledge learned by AlphaGo Zero
AlphaGo Zero discovered a remarkable level of Go knowledge during its self-play training process. This included not only fundamental elements of human Go knowledge but also non-standard strategies beyond the scope of traditional Go knowledge.
Final performance of AlphaGo Zero
Conclusion
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
Introduction
The AlphaGo Zero program recently achieved superhuman performance in the game of Go, by tabula rasa reinforcement learning from games of self-play. In this paper, we generalise this approach into a single AlphaZero algorithm that can achieve, tabula rasa, superhuman performance in many challenging domains, starting from random play, and given no domain knowledge except the game rules.
The AlphaGo Zero algorithm achieved superhuman performance in the game of Go, by representing Go knowledge using deep convolutional neural networks, trained solely by reinforcement learning from games of self-play.
Method
Instead of a handcrafted evaluation function and move ordering heuristics, AlphaZero utilises a deep neural network (p, v) = fθ(s) with parameters θ. This neural network takes the board position s as an input and outputs a vector of move probabilities p with components pa = Pr(a|s) [the probability of each move a under the state s] for each action a, and a scalar value v estimating the expected outcome z from position s, v ≈ E[z|s]. AlphaZero learns these move probabilities and value estimates entirely from selfplay; these are then used to guide its search.
AlphaZero uses a generalpurpose Monte-Carlo tree search (MCTS) algorithm. Each search consists of a series of simulated games of self-play that traverse a tree from root sroot to leaf. Each simulation proceeds by selecting in each state s a move a with low visit count, high move probability and high value
(averaged over the leaf states of simulations that selected a from s) according to the current neural network fθ. The search returns a vector π representing a probability distribution over moves, either proportionally or greedily with respect to the visit counts at the root state.
The parameters θ of the deep neural network in AlphaZero are trained by self-play reinforcement learning, starting from randomly initialised parameters θ. Games are played by selecting moves for both players by MCTS, at ∼ πt. At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for a draw, and +1 for a win. The neural network parameters θ are updated so as to minimise the error between the predicted outcome vt and the game outcome z, and to maximise the similarity of the policy vector pt
to the search probabilities πt. Specifically, the parameters θ are adjusted by gradient descent on a loss function l that sums over mean-squared error and cross-entropy losses respectively
fθ(s)=(p,v)
loss function l = (z-v)2 - πTlogp + c||θ||2
where c is a parameter controlling the level of L2 weight regularisation. The updated parameters are used in subsequent games of self-play.
Differences between AlphaZero and AlphaGo Zero
AlphaGo Zero estimates and optimises the probability of winning, assuming binary win/loss outcomes. AlphaZero instead estimates and optimises the expected outcome, taking account of draws or potentially other outcomes.
The rules of Go are invariant to rotation and reflection. This fact was exploited in AlphaGo and AlphaGo Zero in several ways.
In AlphaGo Zero, self-play games were generated by the best player from all previous iterations. After each iteration of training, the performance of the new player was measured against the best player; if it won by a margin of 55% then it replaced the best player and self-play games were subsequently generated by this new player. In contrast, AlphaZero simply maintains a single neural network that is updated continually, rather than waiting for an iteration to complete.