tic-tac-toe-rl/README.md

# TicTacToe

This repository is a simple implementation of the game of TicTacToe and some
experiments around Reinforcement Learning with it.

## Structure

* `game.py` contains the implementation of the game itself.
* `generate_board_hash_list.py` creates a pickle object containing a list of hash
for every possible unique non-ending variation of the board. It is useful to
create the Q-table latter and need to be precomputed.
* `q_learning.py` contains some experimentation with Q-learning using the
  TicTacToe game as an exemple.

## Implementation details

The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the
dtype `int`. Input it taken from 1 to 9 following this scheme:

```
+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
| 7 | 8 | 9 |
+---+---+---+
```
It is automatically raveled/unravaled when necessary.

We only need to check if there is a win above 5 moves because it impossible to
have a winner below this limit. At 9 moves the board is full and the game is
considered draw if no one won.

## Combinatorics

Without taking into account anything, we can estimate the upper bound of the
number of possible boards. There is `3**9 = 19683` possibilites.

There are 8 different symetries possibles (dihedral group of order 8, aka the
symetry group of the square). This drastically reduce the number of possible
boards.

Taking into account the symetries and the impossible boards (more O than X for
example), we get `765` boards.

Since we do not need to store the last board in the DAG, this number drops to
`627` non-ending boards.

This make our state space size to be `627` and our action space size to be `9`.

## Reward

* `+1` for the winning side
* `-1` for the losign side
* `±0` in case of draw

The reward are given only at the end of an episode, when the winner is
determined. We backtract over all the states and moves to update the Q-table,
given the appropriate reward for each player.
Since the learning is episodic it can only be done at the end.

The learning rate α is set to `1` because the game if fully
deterministic. 

We use an ε-greedy (expentionnally decreasing) strategy for
exploration/exploitation.

The Bellman equation is simplified to the bare minimum for the special case of
an episodic, deterministic, 2 player game.

Maybe some reward shaping could be done to get better result and we would also
try a more complete version of the Bellman equation by considering Q[s+1,a]
which we do not right now. This would necessitate to handle the special case of
the winning board, which are not stored in order to reduce the state space
size.