2016-11-20 58 views
0

我正在嘗試爲tictactoe實施q-learning。這樣做的其中一個步驟涉及列舉tictactoe板的所有可能狀態以形成狀態值表。我寫了一個從空板開始遞歸生成所有可能狀態的過程。爲此,我隱式執行搜索空間樹的預遍歷。然而,最後,我只得到707個獨特的州,而普遍的共識是,合法州的數量約爲5000.tictactoe搜索空間不會產生所有狀態的預訂探索

注意:我指的是合法狀態的數量。我知道,如果任何一名球員在比賽結束後被允許繼續比賽(我的意思是非法狀態),則狀態數量接近19,000。

CODE:

def generate_state_value_table(self, state, turn): 

    winner = int(is_game_over(state)) #check if, for the current turn and state, game has finished and if so who won 
    #print "\nWinner is ", winner 
    #print "\nBoard at turn: ", turn 
    #print_board(state) 
    self.add_state(state, winner/2 + 0.5) #add the current state with the appropriate value to the state table 
    open_cells = open_spots(state) #find the index (from 0 to total no. of cells) of all the empty cells in the board 

    #check if there are any empty cells in the board 
    if len(open_cells) > 0: 
     for cell in open_cells: 
      #pdb.set_trace() 
      row, col = cell/len(state), cell % len(state) 
      new_state = deepcopy(state) #make a copy of the current state 

      #check which player's turn it is 
      if turn % 2 == 0: 
       new_state[row][col] = 1 
      else: 
       new_state[row][col] = -1   

      #using a try block because recursive depth may be exceeded 
      try: 
       #check if the new state has not been generated somewhere else in the search tree 
       if not self.check_duplicates(new_state): 
        self.generate_state_value_table(new_state, turn+1) 
       else: 
        return 
      except: 
       #print "Recursive depth exceeded" 
       exit() 

    else: 
     return 

你可以看一下完整的代碼here如果你想。

編輯: 我整理了一些代碼在鏈接和這裏更多的意見,使事情更清晰。希望有所幫助。

回答

0

所以我終於解決了這個問題,我爲任何面臨類似問題的人提出了這個答案。該錯誤是我處理重複狀態的方式。如果生成的新狀態是在搜索樹的其他位置之前生成的,那麼它不應該添加到狀態表中,但是我犯的錯誤是在縮短預定義遍歷時發現重複狀態,一。

簡而言之:下面從代碼中刪除else子句送給我6046狀態數:

#check if the new state has not been generated somewhere else in the search tree 
if not self.check_duplicates(new_state): 
    self.generate_state_value_table(new_state, turn+1) 
else: 
    return 

此外,我停止探索搜索樹枝時,我遇到了一個狀態,其中有很明顯的優勝者。具體而言,添加以下代碼self.add_state(state, winner/2 + 0.5)後:

#check if the winner returned is one of the players and go back to the previous state if so 
if winner != 0: 
    return 

這給了我的狀態數爲5762這正是我一直在尋找。