|
|
|
@ -35,19 +35,19 @@ considered draw if no one won.
|
|
|
|
|
## Combinatorics
|
|
|
|
|
|
|
|
|
|
Without taking into account anything, we can estimate the upper bound of the
|
|
|
|
|
number of possible boards. There is $ 3^9 = 19683 $ possibilites.
|
|
|
|
|
number of possible boards. There is `3**9 = 19683` possibilites.
|
|
|
|
|
|
|
|
|
|
There are 8 different symetries possibles (dihedral group of order 8, aka the
|
|
|
|
|
symetry group of the square). This drastically reduce the number of possible
|
|
|
|
|
boards.
|
|
|
|
|
|
|
|
|
|
Taking into account the symetries and the impossible boards (more O than X for
|
|
|
|
|
example), we get $765$ boards.
|
|
|
|
|
example), we get `765` boards.
|
|
|
|
|
|
|
|
|
|
Since we do not need to store the last board in the DAG, this number drops to
|
|
|
|
|
$627$ non-ending boards.
|
|
|
|
|
`627` non-ending boards.
|
|
|
|
|
|
|
|
|
|
This make our state space size to be $627$ and our action space size to be $9$.
|
|
|
|
|
This make our state space size to be `627` and our action space size to be `9`.
|
|
|
|
|
|
|
|
|
|
## Reward
|
|
|
|
|
|
|
|
|
@ -60,10 +60,10 @@ determined. We backtract over all the states and moves to update the Q-table,
|
|
|
|
|
given the appropriate reward for each player.
|
|
|
|
|
Since the learning is episodic it can only be done at the end.
|
|
|
|
|
|
|
|
|
|
The learning rate $\alpha$ is set to $1$ because the game if fully
|
|
|
|
|
The learning rate α is set to `1` because the game if fully
|
|
|
|
|
deterministic.
|
|
|
|
|
|
|
|
|
|
We use an $\varepsilon$-greedy (expentionnally decreasing) strategy for
|
|
|
|
|
We use an ε-greedy (expentionnally decreasing) strategy for
|
|
|
|
|
exploration/exploitation.
|
|
|
|
|
|
|
|
|
|
The Bellman equation is simplified to the bare minimum for the special case of
|
|
|
|
|