|otthorn 582e41687b||3 years ago|
|README.md||3 years ago|
|board_hash_list.pkl||3 years ago|
|game.py||3 years ago|
|generate_board_hash_list.py||3 years ago|
|q_learning.py||3 years ago|
This repository is a simple implementation of the game of TicTacToe and some experiments around Reinforcement Learning with it.
game.pycontains the implementation of the game itself.
generate_board_hash_list.pycreates a pickle object containing a list of hash for every possible unique non-ending variation of the board. It is useful to create the Q-table latter and need to be precomputed.
q_learning.pycontains some experimentation with Q-learning using the TicTacToe game as an exemple.
The TicTacToe game is a Python Class. The board is a 3x3 ndarray (numpy) of the
int. Input it taken from 1 to 9 following this scheme:
+---+---+---+ | 1 | 2 | 3 | +---+---+---+ | 4 | 5 | 6 | +---+---+---+ | 7 | 8 | 9 | +---+---+---+
It is automatically raveled/unravaled when necessary.
We only need to check if there is a win above 5 moves because it impossible to have a winner below this limit. At 9 moves the board is full and the game is considered draw if no one won.
Without taking into account anything, we can estimate the upper bound of the
number of possible boards. There is
3**9 = 19683 possibilites.
There are 8 different symetries possibles (dihedral group of order 8, aka the symetry group of the square). This drastically reduce the number of possible boards.
Taking into account the symetries and the impossible boards (more O than X for
example), we get
Since we do not need to store the last board in the DAG, this number drops to
627 non-ending boards.
This make our state space size to be
627 and our action space size to be
+1for the winning side
-1for the losign side
±0in case of draw
The reward are given only at the end of an episode, when the winner is determined. We backtract over all the states and moves to update the Q-table, given the appropriate reward for each player. Since the learning is episodic it can only be done at the end.
The learning rate α is set to
1 because the game is fully
We use an ε-greedy (expentionnally decreasing) strategy for exploration/exploitation.
The Bellman equation is simplified to the bare minimum for the special case of an episodic, deterministic, 2 player game.
Maybe some reward shaping could be done to get better result and we would also try a more complete version of the Bellman equation by considering Q[s+1,a] which we do not right now. This would necessitate to handle the special case of the winning board, which are not stored in order to reduce the state space size.