使用自動化系統啓動R腳本,該腳本使用makeCluster在具有36個CPU的計算機上打開35個節點的集羣。 (AWS c4.8xlarge運行最新的Ubuntu和R)makeCluster無法打開連接 - 錯誤處理策略?
n.nodes = 35
cl <- makeCluster(n.nodes,
outfile = "debug.txt")
作爲寫入DEBUG.TXT以下錯誤出現在略微定期
starting worker pid=2017 on localhost:11823 at 21:15:57.390
Error in socketConnection(master, port = port, blocking = TRUE, open = "a+b", :
cannot open the connection
Calls: <Anonymous> ... doTryCatch -> recvData -> makeSOCKmaster -> socketConnection
In addition: Warning message:
In socketConnection(master, port = port, blocking = TRUE, open = "a+b", :
localhost:11823 cannot be opened
Execution halted
PID和端口號是特定會話。遇到此錯誤時程序無法繼續。
問題1:是否有錯誤處理方法可以識別這種情況並嘗試再次創建集羣?
注:以下不起作用
attempt=0
while(dim(showConnections())[1] < n.nodes && attempt<=25){ # 25 chancees to create n.nodes connections
print(attempt)
closeAllConnections() # Close any open connections
portnum = round(runif(1,11000,11998)) # Randomly Choose a Port
tryCatch({ # Try to create the cluster
evalWithTimeout({
cl <- makeCluster(n.nodes,
outfile = "debug.txt",
port=portnum)
},timeout = 120) # Give it two minutes and then stop trying
},TimeoutException = function(x) {print(paste("Failed to Create Cluster",portnum))}) # If it fails, print the portnum it tried
attempt=attempt+1 # Update attempt
Sys.sleep(2) # Take a breather
}
問題2:如果沒有辦法自動重試做集羣,是有辦法檢查該端口是否可以嘗試之前打開運行makeCluster?
注意:該系統必須是完全自動/自包含的。它必須識別錯誤,處理/解決問題,然後不進行手動干預。