Hexapawn

Reinforcement Learning

A brief history

In 1961, computer scientist Donald Michie, who had assisted Alan Turning with breaking German codes during World War II, pioneered a method of teaching a “computer” to play games without utilizing a single electronic chip. He constructed MENACE (Matchbox Educable Noughts and Crosses Engine) employing 304 matchboxes, a pencil, paper, and colored beads. Each matchbox represented a potential board position in Tic-Tac-Toe, while the beads symbolized permissible moves.

In March 1962, Martin Gardner adapted this concept for his Scientific American column, “Mathematical Games.” He introduced Hexapawn, a game played on a 3×3 grid. Due to Hexapawn’s simplicity compared to Tic-Tac-Toe, it necessitated significantly fewer matchboxes and beads, facilitating the construction of personal learning machines by readers in the 1960s.

“Hexapawn is so simple that the reader can construct a matchbox computer to play it in about an hour.” — Martin Gardner

The Rules of Hexapawn

The game is played on a 3×3 board with three white pawns against three black pawns. In the illustrations below, white pawns always move towards the top of the board, and black pawns always move towards the bottom of the board.

Movement: White always moves first. A pawn moves one square forward to an empty square.

Capture: A pawn captures an enemy pawn diagonally forward. Diagonal moves are only allowed when capturing.

Winning: You win by (1) reaching the opponent’s home row, (2) capturing all enemy pieces, or (3) leaving the opponent with no legal moves.

The machine learning demonstration below is written entirely in HTML, CSS, and JavaScript. It runs locally in your browser with no external resources used for training.

Starting Point: Find All Moves

To teach a computer, we must first identify every possible state the game can be in. We use Recursion to generate these states.

Recursion Analogy: Think of it like exploring a labyrinth. Every time you hit a fork in the road, you create a “mini-version” of yourself to explore that path. Each mini-version does the same until they either find an exit (win) or a dead end (loss), reporting back to the original explorer.

Utilize the widget provided below to access the recursively generated boards. Employ the “EXPAND ALL” button to view all 134 boards, and the “COLLAPSE ALL” button to revert the selection to the initial state. Additionally, you can select the checkboxes adjacent to any potential board arrangement to observe the available moves from that specific configuration. Unselecting the checkbox next to a possible arrangement will conceal that arrangement.
   

 
 
 

Reinforcement Learning

The computer teaches the Black player to improve using Reinforcement Learning. This is a subset of Machine Learning (which is a subset of AI) where an agent learns through rewards and penalties.

How it works:
1. The computer plays 100 games.
2. In each state, it looks at its list of possible moves (the “matchbox”).
3. It picks a move at random from that list.
4. Moves are selected at random until no more moves are available.
5a. If Black wins: No changes are made. The strategy was successful.
5b. If Black loses: The computer finds the very last move Black made and removes it from that state’s list. Black will never make that specific mistake in that specific position again.

After 100 games, Black’s available moves are reduced until only the best choices remain.

Click the “START 100-GAME TRAINING” button below to see how quickly the game is trained, and to get an initial idea of the progress. If you reload the page and train again, you’ll note that the training isn’t exactly the same each time.

X-axis: Number of training games.
Y-axis: Number of games won by black.

Uptick: A win for black.
Downtick: A win for white.

Points ABOVE 0 mark:
Indicates black is winning more than white.
Points BELOW 0 mark: Indicates white is winning more than black.

What actually happened during the training games?

It is one thing to observe the improvement of Black over time, solely relying on the straightforward strategy of Black removing its final move. However, what transpired during the training games?

By activating the “SHOW/HIDE TRAINING GAMES” button located below, you can access a comprehensive view of all the moves (presented in chronological order from left to right) of each game for the entire 100 games. The text beneath each game indicates the winner of that particular game and also signifies that Black removed its final move, in the event of Black’s defeat.

What is the resulting strategy?

Upon commencing with every conceivable move for Hexapawn and conducting 100 games in a manner that enhances Black’s strategic prowess, the resultant set of moves emerges as strongly advantageous to Black.

To access the boards comprising the strategy, utilize the widget provided below. Employ the “EXPAND ALL” button to view all the resulting boards, and the “COLLAPSE ALL” button to revert the selection to its initial state. Furthermore, select the checkboxes adjacent to any potential board arrangement to observe the available moves from that particular configuration. Unselecting the checkbox next to a potential arrangement will conceal that arrangement.