Read Aloud the Text Content
This audio was created by Woord's Text to Speech service by content creators from all around the world.
Text Content or SSML code:
The following describes two novel exploration-constrained variations of MCTS. We will refer to MCTS without these extensions as Vanilla MCTS. Non-exploring MCTS. The search trees created by MCTS will barely be able to reach into the opponent’s turn due to the complexity of the game. To overcome this we have developed a variation of MCTS that uses a non-exploring tree policy, i.e. C = 0, in combination with deterministic rollouts. All children of a node are still visited at least once before any of them are expanded further. This ensures that a very limited form of exploration still occurs, and since rollouts are deterministic, controlled by a greedy policy, it is acceptable to value children based only on one visit. If stochastic rollouts were used instead, some branches would never be revisited when rollouts happen to return unlucky outcomes. Bridge Burning MCTS. Another novel approach to MCTS in multi-action games is what we call the Bridge Burning MCTS (BB-MCTS). This approach splits the time budget into a number of sequential phases equal to the number of actions in a turn. During each phase, the search functions as an ordinary MCTS search. However, in the end of each phase, all but the most promising node from the root are pruned and will never be added again. Another way to implement the same behavior is to treat the most promising child node as the root of the tree in the following phase. This approach is thus an aggressive progressive pruning strategy that will enable the search to reach deeper plies with the drawback of ignoring parts of the search space. The name Bridge Burning emphasizes that the nodes are aggressively pruned and can never be visited again. Figure 4 shows how nodes are pruned in three phases in a multi-action game with 3 actions in a turn For the tree-search methods, it makes sense to investigate the most promising moves first and thus a method for sorting actions is useful. A simple way would be to evaluate the resulting game state of each action, but this is usually a slow method. Instead, attack and spell actions are rated by how much damage they will deal, heal actions (including healing potions) by how many health points they will heal, equip actions by pu hu mu , where pu, hu and mu is the power (equivalent to damage output) of the equipped unit u, health points and maximum health points. Movement actions are given a 30 point rating if movement is to a special square type and 0 otherwise. If an enemy unit is removed from the game, it is given a 2mu point rating. If a knocked down unit u is healed, the rating of the action is mu + |eu| × 200 where |eu| is the number of items carried by the healed unit u. BB-MCTS is to some extend similar to an extreme implementation of sequential halving where only one node survives, and the number of phases is fixed to the number of actions. VI. EXPERIMENTAL SETUP Here the experimental setup is described, which is the basis for the results presented in the next section. Each method played 100 games (as the Council team) against all other methods, 50 games as the starting player and 50 games as the second player. Figure 1 shows the map used. In contrast to the original game, Hero AIcademy was configured to be deterministic and without hidden information to focus our experiments on the challenge of performing multiple actions. Each method was limited to one processor and had a time budget of six seconds each turn. The winning percentages of each matchup counted draws as half a win for each player. While the rules of the original Hero Academy do not include draws, in these experiments a draw was called if no winner was found in 100 rounds. The experiments were carried out on an Intel Core i7-3517U CPU with 4 × 1.90GHz cores and 8 GB of ram. The specific configurations for the implemented game-playing methods were as follows: Vanilla MCTS. The traditional UCT tree policy Xj + 2C q2 ln n nj was employed with an exploration constant of C = √ 1 2 . The default policy was -greedy, where =0.5. A transposition table was implemented with the descent-path only backpropagation strategy. Thus, values and visit counts are stored in edges [37]. In fact, nj in the tree policy is extracted from the child edges instead of the nodes. Rollouts were depth-limited to one turn, following the heuristic state evaluator described in Section IV. Preliminary experiments showed that short rollouts are preferred over long rollouts for MCTS in Hero Academy and that rollouts of just one turn show the best performance. Additionally, by adding some domain knowledge to the default policy through a specific - greedy strategy, the performance improves; -greedy selects a greedy action equivalent to the highest rated action by the action sorting method with a probability of , and a random action otherwise. BB-MCTS. Same as for Vanilla MCTS, but with the Bridge Burning strategy as well. Non-exploring MCTS. Same as for Vanilla MCTS, but with the exploration constant C = 0 and the default policy -greedy, where =1, such that rollouts are deterministic following the action sorting heuristic (Section V). Online Evolutionary Planning (OEP). The population size was 100 with a survival rate of 0.5, a mutation probability of 0.1 and a uniform crossover operator. These parameters were found through prior experimentation. The MCTS-based methods and OEP employ action pruning, which reduces the large search space of a turn by removing (pruning) redundant swap actions and sub-optimal spell actions from the set of available actions in a state. Swap actions are redundant if they swap the same kind of item (i.e. they will produce the same outcome) and can thus be removed. Additionally, spell actions are sub-optimal if other spell actions cover the same or more enemy units. VII. RESULTS Our results, shown in Table II, show the performance of each method. The non-exploring MCTS is the best performing method but with no significant difference compared to Online Evolutionary Planning (OEP) and BB-MCTS, which have a similar performance. Vanilla MCTS plays on the same level as the Greedy Action baseline, which indicates that it is able to identify the action that gives the best immediate reward while it is unable to search sufficiently through the space of possible action sequences. All methods convincingly beat random search. A video of OEP playing against the Greedy Action baseline has been uploaded to YouTube3 . A. Search Characteristic Comparison To get further insights into how the different methods explore the search space, the number of different action sequences each method is able to evaluate within the given time budget was tracked. Since many action sequences produce the same outcome, only the number of unique outcomes evaluated by each method were counted and only those after taking five actions. The Greedy Turn search evaluated 579,912 unique outcomes on average during a turn. OEP evaluated on average 9,344 unique outcomes and MCTS only 201. Each node at the fifth ply of the MCTS tree corresponds to one outcome and the search only manages to expand the tree to a limited number of nodes at this depth. Further analysis reveals that the average depth of leaf nodes in the final MCTS trees is 4.86 plies, while the deepest leaf node of each tree reached an average depth of 6.38 plies. This means that the search tree just barely enters the opponents’ turn even though it manages to run an average of 258,488 iterations per turn. OEP ran an average of 3,693 generations each turn but appeared to get stuck at local optima quickly due to the low number of unique outcomes evaluated. These results suggests that OEP would play almost equally good with a much lower time budget, but also that there could be room to improve the algorithm itself. B. Changing the Number of Actions Multi-action games can have many forms and Hero Academy is just one example. An additional experiment was performed in which our methods were tested in variations of Hero Academy. The rules were altered by changing the number of action points (AP) per turn to 1, 3, 5, 10, 15, 20 and 25. This also increases the complexity of one turn exponentially. The time budget in this experiment was set to 2000 ms. and the turn limit to 5 AP × 100. Reaching the turn limit results in a draw, which is counted as half a win for both players. Only MCTS with our two variations as well as OEP are included in this experiment as it makes the most interesting comparison. The results, which are plotted on Figure 5, show the win percentage of OEP against each MCTS variation in 100 games. The results show that OEP handles the increased number of action points best with a win rate of 55% or more with 10 or more AP against any of the other methods. This indicates that OEP has the best scalability in terms of complexity in multi-action games. Increasing the number of AP to 20 and more makes it possible to win the game in a few turns. This make the outcome of the game highly depend on who gets to start. Vanilla MCTS does however not show that it is able to identify these fast win strategies. C. Versus Human Players One important and often overseen experiment is to test game-playing algorithms against human players. Such com-