board_hash_list.pkl | ||
game.py | ||
generate_board_hash_list.py | ||
q_learning.py | ||
README.md |
TicTacToe
This repository is a simple implementation of the game of TicTacToe and some experiments around Reinforcement Learning with it.
Structure
game.py
contains the implementation of the game itself.generate_board_hash_list.py
creates a pickle object containing a list of hash for every possible unique non-ending variation of the board. It is useful to create the Q-table latter and need to be precomputed.q_learning.py
contains some experimentation with Q-learning using the TicTacToe game as an exemple.
Implementation details
The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the
dtype int
. Input it taken from 1 to 9 following this scheme:
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
| 7 | 8 | 9 |
+---+---+---+
It is automatically raveled/unravaled when necessary.
We only need to check if there is a win above 5 moves because it impossible to have a winner below this limit. At 9 moves the board is full and the game is considered draw if no one won.
Combinatorics
Without taking into account anything, we can estimate the upper bound of the
number of possible boards. There is 3^9 = 19683
possibilites.
There are 8 different symetries possibles (dihedral group of order 8, aka the symetry group of the square). This drastically reduce the number of possible boards.
Taking into account the symetries and the impossible boards (more O than X for
example), we get 765
boards.
Since we do not need to store the last board in the DAG, this number drops to
627
non-ending boards.
This make our state space size to be 627
and our action space size to be 9
.
Reward
+1
for the winning side-1
for the losign side±0
in case of draw
The reward are given only at the end of an episode, when the winner is determined. We backtract over all the states and moves to update the Q-table, given the appropriate reward for each player. Since the learning is episodic it can only be done at the end.
The learning rate \alpha
is set to 1
because the game if fully
deterministic.
We use an $\varepsilon$-greedy (expentionnally decreasing) strategy for exploration/exploitation.
The Bellman equation is simplified to the bare minimum for the special case of an episodic, deterministic, 2 player game.
Maybe some reward shaping could be done to get better result and we would also try a more complete version of the Bellman equation by considering Q[s+1,a] which we do not right now. This would necessitate to handle the special case of the winning board, which are not stored in order to reduce the state space size.