77 lines
2.6 KiB
Markdown
77 lines
2.6 KiB
Markdown
|
# TicTacToe
|
||
|
|
||
|
This repository is a simple implementation of the game of TicTacToe and some
|
||
|
experiments around Reinforcement Learning with it.
|
||
|
|
||
|
## Structure
|
||
|
|
||
|
* `game.py` contains the implementation of the game itself.
|
||
|
* `generate_board_hash_list.py` creates a pickle object containing a list of hash
|
||
|
for every possible unique non-ending variation of the board. It is useful to
|
||
|
create the Q-table latter and need to be precomputed.
|
||
|
* `q_learning.py` contains some experimentation with Q-learning using the
|
||
|
TicTacToe game as an exemple.
|
||
|
|
||
|
## Implementation details
|
||
|
|
||
|
The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the
|
||
|
dtype `int`. Input it taken from 1 to 9 following this scheme:
|
||
|
|
||
|
```
|
||
|
+---+---+---+
|
||
|
| 1 | 2 | 3 |
|
||
|
+---+---+---+
|
||
|
| 4 | 5 | 6 |
|
||
|
+---+---+---+
|
||
|
| 7 | 8 | 9 |
|
||
|
+---+---+---+
|
||
|
```
|
||
|
It is automatically raveled/unravaled when necessary.
|
||
|
|
||
|
We only need to check if there is a win above 5 moves because it impossible to
|
||
|
have a winner below this limit. At 9 moves the board is full and the game is
|
||
|
considered draw if no one won.
|
||
|
|
||
|
## Combinatorics
|
||
|
|
||
|
Without taking into account anything, we can estimate the upper bound of the
|
||
|
number of possible boards. There is $ 3^9 = 19683 $ possibilites.
|
||
|
|
||
|
There are 8 different symetries possibles (dihedral group of order 8, aka the
|
||
|
symetry group of the square). This drastically reduce the number of possible
|
||
|
boards.
|
||
|
|
||
|
Taking into account the symetries and the impossible boards (more O than X for
|
||
|
example), we get $765$ boards.
|
||
|
|
||
|
Since we do not need to store the last board in the DAG, this number drops to
|
||
|
$627$ non-ending boards.
|
||
|
|
||
|
This make our state space size to be $627$ and our action space size to be $9$.
|
||
|
|
||
|
## Reward
|
||
|
|
||
|
* `+1` for the winning side
|
||
|
* `-1` for the losign side
|
||
|
* `±0` in case of draw
|
||
|
|
||
|
The reward are given only at the end of an episode, when the winner is
|
||
|
determined. We backtract over all the states and moves to update the Q-table,
|
||
|
given the appropriate reward for each player.
|
||
|
Since the learning is episodic it can only be done at the end.
|
||
|
|
||
|
The learning rate $\alpha$ is set to $1$ because the game if fully
|
||
|
deterministic.
|
||
|
|
||
|
We use an $\varepsilon$-greedy (expentionnally decreasing) strategy for
|
||
|
exploration/exploitation.
|
||
|
|
||
|
The Bellman equation is simplified to the bare minimum for the special case of
|
||
|
an episodic, deterministic, 2 player game.
|
||
|
|
||
|
Maybe some reward shaping could be done to get better result and we would also
|
||
|
try a more complete version of the Bellman equation by considering Q[s+1,a]
|
||
|
which we do not right now. This would necessitate to handle the special case of
|
||
|
the winning board, which are not stored in order to reduce the state space
|
||
|
size.
|