tic-tac-toe-rl/README.md
2021-05-17 00:53:53 +02:00

2.6 KiB
Raw Permalink Blame History

TicTacToe

This repository is a simple implementation of the game of TicTacToe and some experiments around Reinforcement Learning with it.

Structure

  • game.py contains the implementation of the game itself.
  • generate_board_hash_list.py creates a pickle object containing a list of hash for every possible unique non-ending variation of the board. It is useful to create the Q-table latter and need to be precomputed.
  • q_learning.py contains some experimentation with Q-learning using the TicTacToe game as an exemple.

Implementation details

The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the dtype int. Input it taken from 1 to 9 following this scheme:

+---+---+---+
| 1 | 2 | 3 |
+---+---+---+
| 4 | 5 | 6 |
+---+---+---+
| 7 | 8 | 9 |
+---+---+---+

It is automatically raveled/unravaled when necessary.

We only need to check if there is a win above 5 moves because it impossible to have a winner below this limit. At 9 moves the board is full and the game is considered draw if no one won.

Combinatorics

Without taking into account anything, we can estimate the upper bound of the number of possible boards. There is 3**9 = 19683 possibilites.

There are 8 different symetries possibles (dihedral group of order 8, aka the symetry group of the square). This drastically reduce the number of possible boards.

Taking into account the symetries and the impossible boards (more O than X for example), we get 765 boards.

Since we do not need to store the last board in the DAG, this number drops to 627 non-ending boards.

This make our state space size to be 627 and our action space size to be 9.

Reward

  • +1 for the winning side
  • -1 for the losign side
  • ±0 in case of draw

The reward are given only at the end of an episode, when the winner is determined. We backtract over all the states and moves to update the Q-table, given the appropriate reward for each player. Since the learning is episodic it can only be done at the end.

The learning rate α is set to 1 because the game is fully deterministic.

We use an ε-greedy (expentionnally decreasing) strategy for exploration/exploitation.

The Bellman equation is simplified to the bare minimum for the special case of an episodic, deterministic, 2 player game.

Maybe some reward shaping could be done to get better result and we would also try a more complete version of the Bellman equation by considering Q[s+1,a] which we do not right now. This would necessitate to handle the special case of the winning board, which are not stored in order to reduce the state space size.