2021-05-17 00:50:03 +02:00
|
|
|
|
# TicTacToe
|
|
|
|
|
|
|
|
|
|
This repository is a simple implementation of the game of TicTacToe and some
|
|
|
|
|
experiments around Reinforcement Learning with it.
|
|
|
|
|
|
|
|
|
|
## Structure
|
|
|
|
|
|
|
|
|
|
* `game.py` contains the implementation of the game itself.
|
|
|
|
|
* `generate_board_hash_list.py` creates a pickle object containing a list of hash
|
|
|
|
|
for every possible unique non-ending variation of the board. It is useful to
|
|
|
|
|
create the Q-table latter and need to be precomputed.
|
|
|
|
|
* `q_learning.py` contains some experimentation with Q-learning using the
|
|
|
|
|
TicTacToe game as an exemple.
|
|
|
|
|
|
|
|
|
|
## Implementation details
|
|
|
|
|
|
|
|
|
|
The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the
|
|
|
|
|
dtype `int`. Input it taken from 1 to 9 following this scheme:
|
|
|
|
|
|
|
|
|
|
```
|
|
|
|
|
+---+---+---+
|
|
|
|
|
| 1 | 2 | 3 |
|
|
|
|
|
+---+---+---+
|
|
|
|
|
| 4 | 5 | 6 |
|
|
|
|
|
+---+---+---+
|
|
|
|
|
| 7 | 8 | 9 |
|
|
|
|
|
+---+---+---+
|
|
|
|
|
```
|
|
|
|
|
It is automatically raveled/unravaled when necessary.
|
|
|
|
|
|
|
|
|
|
We only need to check if there is a win above 5 moves because it impossible to
|
|
|
|
|
have a winner below this limit. At 9 moves the board is full and the game is
|
|
|
|
|
considered draw if no one won.
|
|
|
|
|
|
|
|
|
|
## Combinatorics
|
|
|
|
|
|
|
|
|
|
Without taking into account anything, we can estimate the upper bound of the
|
2021-05-17 00:53:16 +02:00
|
|
|
|
number of possible boards. There is `3**9 = 19683` possibilites.
|
2021-05-17 00:50:03 +02:00
|
|
|
|
|
|
|
|
|
There are 8 different symetries possibles (dihedral group of order 8, aka the
|
|
|
|
|
symetry group of the square). This drastically reduce the number of possible
|
|
|
|
|
boards.
|
|
|
|
|
|
|
|
|
|
Taking into account the symetries and the impossible boards (more O than X for
|
2021-05-17 00:53:16 +02:00
|
|
|
|
example), we get `765` boards.
|
2021-05-17 00:50:03 +02:00
|
|
|
|
|
|
|
|
|
Since we do not need to store the last board in the DAG, this number drops to
|
2021-05-17 00:53:16 +02:00
|
|
|
|
`627` non-ending boards.
|
2021-05-17 00:50:03 +02:00
|
|
|
|
|
2021-05-17 00:53:16 +02:00
|
|
|
|
This make our state space size to be `627` and our action space size to be `9`.
|
2021-05-17 00:50:03 +02:00
|
|
|
|
|
|
|
|
|
## Reward
|
|
|
|
|
|
|
|
|
|
* `+1` for the winning side
|
|
|
|
|
* `-1` for the losign side
|
|
|
|
|
* `±0` in case of draw
|
|
|
|
|
|
|
|
|
|
The reward are given only at the end of an episode, when the winner is
|
|
|
|
|
determined. We backtract over all the states and moves to update the Q-table,
|
|
|
|
|
given the appropriate reward for each player.
|
|
|
|
|
Since the learning is episodic it can only be done at the end.
|
|
|
|
|
|
2021-05-17 00:53:16 +02:00
|
|
|
|
The learning rate α is set to `1` because the game if fully
|
2021-05-17 00:50:03 +02:00
|
|
|
|
deterministic.
|
|
|
|
|
|
2021-05-17 00:53:16 +02:00
|
|
|
|
We use an ε-greedy (expentionnally decreasing) strategy for
|
2021-05-17 00:50:03 +02:00
|
|
|
|
exploration/exploitation.
|
|
|
|
|
|
|
|
|
|
The Bellman equation is simplified to the bare minimum for the special case of
|
|
|
|
|
an episodic, deterministic, 2 player game.
|
|
|
|
|
|
|
|
|
|
Maybe some reward shaping could be done to get better result and we would also
|
|
|
|
|
try a more complete version of the Bellman equation by considering Q[s+1,a]
|
|
|
|
|
which we do not right now. This would necessitate to handle the special case of
|
|
|
|
|
the winning board, which are not stored in order to reduce the state space
|
|
|
|
|
size.
|