# Final Report: Parallel Game Tree Evaluation

Nancy Xiao (nnx) and Jemmin Chang (jemminc)

## Summary

We investigated how to best achieve speedup with a multi-core CPU in the task of game tree evaluation. From a baseline of sequential minimax, we implemented and optimized the runtime of parallel minimax and sequential and parallel alpha-beta pruning algorithms, testing them on Gomoku (generalized Tic-Tac-Toe) on the latedays cluster.

Our code is on GitHub.

## Background

Overall, we specified the number of plies (depth of the game tree), trials and turns to run for the game. After running the game for the specified number of turns, each to the specified depth in the game tree, we recorded the output: time in seconds it took to complete those turns. Implementation-wise, we wanted a generalized version that could work with any solver algorithm and any two-player combinatorial game. As such, we created two interfaces: one for games and the other for various solvers.

1. Game:
1. Gomoku - A generalized version of Tic-Tac-Toe. It has two parameters: n and k. n represents the width and height of the board. k represents the number of same pieces in a row (diagonally, horizontally, or vertically) a player needs to win. We chose Gomoku because it has a high branching factor. For this game, there is low memory usage and simple state evaluation computation. The central region of the board requires more computation than the borders.
2. <state, move>: The state keeps track of which player moves next (whose turn) and the state of the board.
3. availableMoves(): Returns a vector of all possible moves based off the state. In this case, all the empty grid cells left.
4. playMove(Move m): Plays the move specified
5. undoMove(Move m): Undoes the move specified
6. Etc.
2. Solver Algorithms:
1. Minimax Algorithm: finds the best next move in a game by minimizing the worst case scenario. It searches the game tree to the specified depth, calculating the value of each children based off a heuristic, and determines the best possible move. Minimax can be parallelized by computing the moves (child nodes) in parallel. See proposal below for diagram.
2. Alpha-Beta Algorithm: similar to minimax algorithm in goal and purpose. The key difference is that it prunes away (does not compute or evaluate) the branches (moves) that are evidently worse by looking at the results of the previously computed branches. This algorithm is inherently sequential. We will be spending the bulk of our time parallelizing this well to achieve speedup.
3. Principal Variation Splitting: The purpose of principal variation splitting is to start the parallel alpha-beta threads with a decent starting alpha-beta bound to yield more pruning. It does this by sequentially running the leftmost child to get starting alpha-beta bounds to evaluate the remaining children in parallel with. This approach assumes that the leftmost child is a decent (and hopefully very good) move. The better the leftmost child is as a move, the more it can prune away in the parallel run of the rest of the children. The sequential aspect is not that bad as the PVS algorithm is recursive, evaluating the remaining children of each level in parallel so our cores are never sitting idle.

## Approach

First, we designed the interfaces in which we worked. This includes the game interface and the solver interface (for more details see the background section). From there, we implemented both the games and the solvers.

Initially, we wanted to implement chess as at each step, there are a lot of possible moves (and therefore a high branching factor) and would benefit from minimax and alphabeta pruning. We realized that it would be more worth our time to implement and improve upon the parallel solvers, especially the alpha-beta pruning algorithm, so we stopped implementing chess. However, we still needed to implement a game to test the solvers (baseline and improvements). We decided to implement Tic-Tac-Toe. Because there weren’t too many turns to simulate, we opted for a more generalized version of Tic-Tac-Toe (nxn board must get n in a row). However, that game tends to tie quickly and the range of values for possible moves is narrow, so we decided to implement othello, until we realized that othello also does not have a very high branching factor. Ultimately, we implemented Gomoku, an even more generalized version of Tic-Tac-Toe as one wins with k in a row on a nxn board. We also had to design a heuristic (evaluation function of a leaf node/move) to the best of our ability.

For the solvers, we started by implementing the baselines: sequential minimax, parallel minimax, sequential alphabeta and parallel alphabeta. For the first parallel solvers, we used OpenMP where each thread is statically assigned a chunk of the moves to evaluate. In this way, we achieved a 12x speedup (number of processors in a latedays cluster) from sequential to parallel minimax. However, we didn’t achieve such ideal speedup from sequential to parallel alpha beta pruning (only 5.8x speedup). (We also implemented alternative versions of each that either copied the game before playing a candidate move and evaluating its value, or just playing the move on the game, evaluating, then undoing it. The latter we supposed would reduce time spent copying arrays, but it turned out to make little difference, since the rest of the computation (evaluation heuristic) dominated.) Thinking that our parallel alpha-beta might have sub-optimal speedup because of contention over global variables (alpha, beta, best_score, best_move), we decided to implement another version of OpenMP parallel alpha-beta pruning that kept a thread-local version of all those variables and combined them at the end, as well as global and thread-local versions of parallel minimax. In this way, there would be less writes to the same chunk of memory. Upon testing, we saw that the global and local version of minimax were taking the same amount of time consistently and realized that there actually isn’t much contention, since the variables are updated relatively infrequently given the long tree evaluation computation that occurs between each potential update. We also noticed that the global alpha beta pruning solver was consistently faster than the local one. This is because the local version does not prune away as many branches as the global version does, since each thread obtains a worse thread-local alpha-beta bound in its course than it would if it shared a global one. See Figure 1.

In order to further improve the parallelization and achieve greater speedup of alpha-beta pruning, we tried implementing principal variation splitting. We tried three methods of choosing the leftmost child (the move we hope to be fairly good as so to prune away more branches later): first move, random move, and best move (please see results below to understand the differences in outcome of the three methods).

## Results

For all of our evaluation, we used a 15x15 Gomoku game with 5-in-a-row to win. We made 10 consecutive turns each searching to a depth of 3 in the game tree to find the best move according to minimax (correct pruning doesn’t change the final move chosen, modulo ties). We ran these evaluations for each different solving algorithm (sequential minimax, sequential alpha-beta, parallel alpha-beta, PVS, etc.) from each of two starting positions: an empty board (initial position) and an intermediate position (10 turns in). We measured wall clock time to complete the 10 turns. The goal was to minimize the ratio of parallel algorithm wall clock time to sequential algorithm wall clock time.

Our baselines are sequential minimax and sequential alpha-beta algorithms. For scale, it’s interesting to note that sequential alpha-beta runs 12x faster than sequential minimax due to pruning. Parallel minimax implemented with OpenMP achieved ~12x speedup over sequential, as expected from the 12 physical cores on latedays. Our basic parallel alpha-beta implemented with OpenMP achieved 5.8x speedup over sequential alpha-beta, roughly half of the ideal speedup. See Figure 2. Figure 2: Sequential vs. simple parallel (OpenMP) minimax and alpha-beta runtime

We implemented Principal Variation Splitting, an improvement on our simple parallel alpha-beta, to increase its pruning and reduce computation. The success of PVS depended on how the principal variation (the move evaluated sequentially to obtain an initial alpha-beta bound) was selected; the better the PV move, the better bound it yields, leading to more pruning. Choosing the first available move as PV demonstrated performance worse than the basic parallel alpha-beta, probably due to the extra initial sequential work of PVS. The first move does not yield a useful bound at the stages of the game we evaluated on because it is in the corner of the board, too far from existing placed pieces. Choosing a random move as PV performed better, balancing out the overhead of PVS to get equal performance with our basic parallel alpha-beta, but a randomly chosen move is still most likely to be bad. Finally, using the leaf node heuristic to evaluate moves and select the best one according to its score to use as PV yielded runtime about 20% shorter than the basic parallel alpha-beta, representing a final best speedup of 7.2x faster runtime than the sequential alpha-beta algorithm. The results shown below are all when starting from an initial state of the board. From the intermediate state, all of our PVS algorithms yielded performance equal to or worse than the basic parallel alpha-beta, especially the heuristic selector. We hypothesize that the simple heuristic function we used gives an inadequate estimate of the value of intermediate positions, although it works reasonably well towards the initial state of the game. See Figure 3.