Download Free Audio of Since the board configuration is only one part of ... - Woord

Read Aloud the Text Content

This audio was created by Woord's Text to Speech service by content creators from all around the world.


Text Content or SSML code:

Since the board configuration is only one part of the game state and items and cards are not considered, the statespace complexity of Hero Academy is much larger than our estimated lower bound. As a comparison, Chess has a statespace complexity of 1043 [34]. Hero Academy also introduces hidden information, as the opponent’s cards are unknown, as well as the order of cards in the deck. Stochastic games with hidden information can be approached with various forms of determinization methods [35], [36]. In this paper, we ignore the aspect of hidden information and randomness since we are only interested in ways to deal with the complexities of multi-action games. IV. ONLINE EVOLUTIONARY PLANNING In this section we present an evolutionary algorithm that, inspired by the rolling horizon evolution, evolves strategies while it plays the game. We call this algorithm Online Evolutionary Planning (OEP), and we have implemented it to play Hero Academy, where it aims to evolve optimal action sequences every turn. Each genome in a population thus represents a sequence of five actions. An exhaustive search is not able to explore the entire space of action sequences within a reasonable time frame and may miss many interesting choices. An evolutionary algorithm on the other hand can explore the search space in a very different way. An overview of our online evolutionary planning algorithm is presented next, which is also shown in pseudocode (Algorithm 1). Evolutionary algorithms iteratively optimize an initially randomized population of candidate solutions (genomes) through recombination and selection based on elitism. When applied to planning in Hero Academy games, the genotype of each genome in the population is a vector of five actions, where one action represents a type and one or more locations if needed. An example of a genotype is: [Move((0, 4) → (2, 4)), Heal((2, 4) → (4, 4)), Heal((2, 4) → (4, 4)), Attack((4, 0) → (6, 1)), Deploy(0 → (0, 4))] which is also visualized in Figure 2. Note, that identical attack or heal actions can be repeated to deal more damage or gain more health. Locations given in two dimensions are on the board (from the top left square) and in one dimension are cards on the hand (from left to right). The phenotype is the resulting game state after taking these actions in the current game state. The initial population is composed of random genomes, which are created by repeatedly selecting random actions based on the given forward model. This process is repeated until no more action points are left. After the creation of the initial population, the population is improved over a large number of generations until a given time budget is exhausted. In this paper, Hero AIcademy itself serves as the forward model and the fitness of an action sequence is calculated as the difference between the values of both players’ units. Both the units on the game board as well as those still at the players’ disposal are taken into account. The assumption behind this particular fitness function is that the difference in units serves as a good indicator for which player is more likely to win. In more detail, the value v(u) of unit u is calculated as follows: where uhp is the unit’s number of health points, sq(u) is a bonus based on the type of the square the unit stands on, and eq(u) is a bonus that depends on the unit’s equipment. The exact modifiers are shown in Table I. The modifying term up(u) gives a negative reward for knocked down units: After the evaluation, the genomes with the lowest scores are removed from the population. Each one of the remaining genomes is paired with another randomly selected genome, creating a new offspring through uniform crossover. Figure 3 shows the crossover between two example action sequences in Hero Academy. Because a naive crossover can lead to illegal action sequences, the crossover checks for the legality of a move when two sequences are combined. For example, to be able to move a unit from a certain position on the board it is required that a unit in fact stands on that particular square. However, this precondition might not be fulfilled due to an earlier action in the sequence. To avoid such situations, actions are only selected from a randomly chosen parent if it can be performed legally, otherwise the action will be taken from the other parent. If neither one of the two actions results in a legal move, the next action in the parent’s sequence is chosen. In case this fails as well, a completely random available action is selected. Additionally, a certain proportion of the created offspring is mutated to introduce new actions to the population. Mutation changes one randomly chosen action to another legal action. If this results in an illegal action sequence, the following part of the sequence is changed to random but legal actions. Legality is checked by requesting the framework for available actions at each state by traversing the offspring’s actions sequence. To incorporate information about possible counter moves, attempts were made to base the heuristic on rollouts. Here fitness is determined by performing one rollout with a depth limit of five actions, corresponding to one turn. When a genome is tested more than once (because it has survived several generations in the population), the lowest value found in the genome’s evaluations is used. The rating of an action sequence thus depends on the best of the known countermoves. Because our experiments did not show any significant difference between a stochastic rollout as a fitness measure and a static evaluation, the later was chosen for the experiments in this paper. The large branching factor in this game is possibly the reason why evaluations are unreliable when based on a low number of rollouts. V. TREE SEARCH A game tree can be described as an acyclic directed graph with the root node being the current game state. Edges in the graph represent available actions in the game that lead from one state to other hypothetical future game states. Therefore, the number of edges from a node corresponds to the number of actions available to the active player in that game state. Nodes also have certain values assigned to them, where higher values indicate more desirable game situations. For many adversarial games, in which the utility of one agent is the opposite of the other, agents take turns and thus the active player alternates between plies of the tree. In these situations, the well-known Minimax algorithm can be applied. However, in Hero Academy players can take several actions before the end of their turn. A potential tree search setup for Hero Academy could be to encode multiple actions as a joint action (i.e. an array of actions) that is assigned to each edge. However, because of the high number of possible permutations and therefore increased branching factor, we decided to model each action as its own node, essentially trading tree breath for depth. In the following, we will present five game-playing treesearch methods for Hero Academy. Two of these are simple game-tree based methods, which were used as baselines, followed by MCTS including two novel variations. Greedy Action. The Greedy search among actions method is the simplest of the developed methods. Greedy Action performs a one-ply search among all possible actions, and based on the employed heuristic (Equation 2) selects the action that leads to the most promising game state. The search is invoked five times to complete a turn. Greedy Turn. The Greedy search among turns performs a five-ply depth-first search, which corresponds to a full turn. The same heuristic as for Greedy Action rates all leaf nodes states and selects the action sequence that leads to the highestrated state. A transposition table keeps track of already visited game states so that they are not visited again. Because this search is usually not exhaustive, choosing which actions to try first (i.e. action sorting) is critical. Vanilla MCTS. Following the two greedy search variants, the vanilla MCTS method was implemented with an action based approach. In other words, one ply in the tree represents an action, not a turn. Therefore, a search with a depth of five has to be performed to reach the beginning of the opponent’s turn. Our vanilla MCTS agent implements the four phases described in Section II-A, which follows the traditional MCTS algorithm using UCB [14]. The standard MCTS backpropagation was modified to handle two players with multiple actions. Our approach [15] is an extension of the BackupNegamax [14] algorithm (see Algorithm 2). It uses a list of edges reflecting the traversal during the selection phase, a ∆ value corresponding to the result of the simulation phase and a boolean p1 that is true if player one is the max player and false otherwise. An -greedy approach is employed in the rollouts that combines random play with the highest rated action based on the amount of damage dealt/healed. Rollouts that stop prior to a terminal state are evaluated by our heuristic (Equation 2). As players in Hero Academy have to select five actions, two different approaches were tried: in the first approach the agent is invoked five times sequentially, with each iteration using a fifth of the total time budget, where after the actions are executed in the order found. However, another approach was found to be superior that uses the entire time budget, where after the tree is traversed from the root by selecting the best five nodes (actions) until the opponents turn is reached. This will allow the tree to expand as much as possible before actions are selected, and thus more options that are complete will be considered within the time budget.