2016-12-14 21 views
1

我在學習案例中遇到問題。 我對gridworld模型的強化學習感興趣。 模型是運動的7x7領域的迷宮。 考慮一個領域的迷宮。有四個方向:上,下,左和右(或N,E,S,W)。所以最多的政策是。在碰撞牆上使用直接懲罰時,許多人可以被排除在外。 另外採用抑制回報原則,通常更少的行爲是可以接受的。許多政策僅在目標之後的部分或者具有同等效力。如何在R程序中獲得SARSA代碼爲gridworld模型?

▼國家:障礙物 ▼獎勵:如果r = 1,如果S = G,否則r = 0的任何容許的舉動和else R = -100 ▼初始化:Q0(A,S)〜N(0 ,0.01)

要解決這個模型,我做了一個R代碼,但它不能正常工作。

模式:7x7的,S:啓動狀態,G:終端狀態,O:存取狀態,X:牆壁

[O,O,G,X,O,O,S] 
[O,X,O,X,O,X,X] 
[O,X,O,X,O,O,O] 
[O,X,O,X,O,X,O] 
[O,X,O,O,O,X,O] 
[O,X,O,X,O,X,O] 
[O,O,O,X,O,O,O] 

所以我想知道如何能爲這個gridworld模型正確的代碼(不uppon代碼)並且想知道如何通過SARSA模型來解決這個模型。

actions <- c("N", "S", "E", "W") 

x <- 1:7 
y <- 1:7 

rewards <- matrix(rep(0, 49), nrow=7) 

rewards[1, 1] <- 0 
rewards[1, 2] <- 0 
rewards[1, 3] <- 1 
rewards[1, 4] <- -100 
rewards[1, 5] <- 0 
rewards[1, 6] <- 0 
rewards[1, 7] <- 0 
rewards[2, 1] <- 0 
rewards[2, 2] <- -100 
rewards[2, 3] <- 0 
rewards[2, 4] <- -100 
rewards[2, 5] <- 0 
rewards[2, 6] <- -100 
rewards[2, 7] <- -100 
rewards[3, 1] <- 0 
rewards[3, 2] <- -100 
rewards[3, 3] <- 0 
rewards[3, 4] <- -100 
rewards[3, 5] <- 0 
rewards[3, 6] <- 0 
rewards[3, 7] <- 0 
rewards[4, 1] <- 0 
rewards[4, 2] <- -100 
rewards[4, 3] <- 0 
rewards[4, 4] <- -100 
rewards[4, 5] <- 0 
rewards[4, 6] <- -100 
rewards[4, 7] <- 0 
rewards[5, 1] <- 0 
rewards[5, 2] <- -100 
rewards[5, 3] <- 0 
rewards[5, 4] <- 0 
rewards[5, 5] <- 0 
rewards[5, 6] <- -100 
rewards[5, 7] <- 0 
rewards[6, 1] <- 0 
rewards[6, 2] <- -100 
rewards[6, 3] <- 0 
rewards[6, 4] <- -100 
rewards[6, 5] <- 0 
rewards[6, 6] <- -100 
rewards[6, 7] <- 0 
rewards[7, 1] <- 0 
rewards[7, 2] <- 0 
rewards[7, 3] <- 0 
rewards[7, 4] <- -100 
rewards[7, 5] <- 0 
rewards[7, 6] <- 0 
rewards[7, 7] <- 0 

values <- rewards # initial values 

states <- expand.grid(x=x, y=y) 

# Transition probability 
transition <- list("N" = c("N" = 0.8, "S" = 0, "E" = 0.1, "W" = 0.1), 
     "S"= c("S" = 0.8, "N" = 0, "E" = 0.1, "W" = 0.1), 
     "E"= c("E" = 0.8, "W" = 0, "S" = 0.1, "N" = 0.1), 
     "W"= c("W" = 0.8, "E" = 0, "S" = 0.1, "N" = 0.1)) 

# The value of an action (e.g. move north means y + 1) 
action.values <- list("N" = c("x" = 0, "y" = 1), 
     "S" = c("x" = 0, "y" = -1), 
     "E" = c("x" = 1, "y" = 0), 
     "W" = c("x" = -1, "y" = 0)) 

# act() function serves to move the robot through states based on an action 
act <- function(action, state) { 
    action.value <- action.values[[action]] 
    new.state <- state 
     if(state["x"] == 1 && state["y"] == 7 || (state["x"] == 1 && state["y"] == 3)) 
     return(state) 
    # 
    new.x = state["x"] + action.value["x"] 
    new.y = state["y"] + action.value["y"] 
    # Constrained by edge of grid 
    new.state["x"] <- min(x[length(x)], max(x[1], new.x)) 
    new.state["y"] <- min(y[length(y)], max(y[1], new.y)) 
    # 
    if(is.na(rewards[new.state["y"], new.state["x"]])) 
     new.state <- state 
    # 
    return(new.state) 
} 


rewards 

bellman.update <- function(action, state, values, gamma=1) { 
    state.transition.prob <- transition[[action]] 
    q <- rep(0, length(state.transition.prob)) 
    for(i in 1:length(state.transition.prob)) {   
     new.state <- act(names(state.transition.prob)[i], state) 
     q[i] <- (state.transition.prob[i] * (rewards[state["y"],  state["x"]] + (gamma * values[new.state["y"], new.state["x"]]))) 
    } 
    sum(q) 
} 

value.iteration <- function(states, actions, rewards, values, gamma, niter,  n) { 
    for (j in 1:niter) { 
     for (i in 1:nrow(states)) { 
      state <- unlist(states[i,]) 
      if(i %in% c(7, 15)) next # terminal states 
      q.values <- as.numeric(lapply(actions, bellman.update,  state=state, values=values, gamma=gamma)) 
      values[state["y"], state["x"]] <- max(q.values) 
     } 
    } 
    return(values) 
} 

final.values <- value.iteration(states=states, actions=actions,  rewards=rewards, values=values, gamma=0.99, niter=100, n=10) 

final.values 

回答

0

問題是你的懲罰比獎勵大得多。代理人可能更喜歡把自己扔在牆上,而不是試圖獲得獎勵。發生這種情況的原因是,國家行爲價值收斂於非常低的實數,甚至低於-100,這取決於行動的回報。

這裏是我由模擬值迭代(其呈現的值即SARSA應當收斂於)模型:

enter image description here

值的表表示在圖片中的模型的值的狀態,但它是倒置的(因爲我還沒有修復它)。

在這種情況下,我把獎勵和懲罰的價值與您的模型非常相似。 -15是一個公正的狀態(牆),1.0是球,-100是塊。代理每個動作獲得0.0,轉換概率也相同。

代理必須到達球,但正如你看到的狀態匯聚到非常小的值。在這裏,你可以看到球的相鄰狀態值較低。所以代理人寧願永遠達不到目標。

要解決您的問題,請儘量減少處罰。