我正在嘗試爲tictactoe實施q-learning。這樣做的其中一個步驟涉及列舉tictactoe板的所有可能狀態以形成狀態值表。我寫了一個從空板開始遞歸生成所有可能狀態的過程。爲此,我隱式執行搜索空間樹的預遍歷。然而,最後,我只得到707個獨特的州,而普遍的共識是,合法州的數量約爲5000.tictactoe搜索空間不會產生所有狀態的預訂探索
注意:我指的是合法狀態的數量。我知道,如果任何一名球員在比賽結束後被允許繼續比賽(我的意思是非法狀態),則狀態數量接近19,000。
CODE:
def generate_state_value_table(self, state, turn):
winner = int(is_game_over(state)) #check if, for the current turn and state, game has finished and if so who won
#print "\nWinner is ", winner
#print "\nBoard at turn: ", turn
#print_board(state)
self.add_state(state, winner/2 + 0.5) #add the current state with the appropriate value to the state table
open_cells = open_spots(state) #find the index (from 0 to total no. of cells) of all the empty cells in the board
#check if there are any empty cells in the board
if len(open_cells) > 0:
for cell in open_cells:
#pdb.set_trace()
row, col = cell/len(state), cell % len(state)
new_state = deepcopy(state) #make a copy of the current state
#check which player's turn it is
if turn % 2 == 0:
new_state[row][col] = 1
else:
new_state[row][col] = -1
#using a try block because recursive depth may be exceeded
try:
#check if the new state has not been generated somewhere else in the search tree
if not self.check_duplicates(new_state):
self.generate_state_value_table(new_state, turn+1)
else:
return
except:
#print "Recursive depth exceeded"
exit()
else:
return
你可以看一下完整的代碼here如果你想。
編輯: 我整理了一些代碼在鏈接和這裏更多的意見,使事情更清晰。希望有所幫助。