2011-08-08 46 views
2

定義custodiet ipsos custodes? - (Decimus Iunius Iuvenalis)使進程保持活動狀態的遠程節點

我具有以下設置:

在一個服務器進程上運行的一個節點(「[email protected]」),其具有上運行的一個另一節點的看門狗(」 [email protected]')。當服務器啓動時,它將在遠程節點上啓動其看門狗。當服務器非正常退出時,看門狗再次啓動服務器。當看門狗退出時,服務器再次啓動它。

服務器在網絡啓動後作爲運行級別的一部分啓動。

服務器還監視遠程節點,並在它(即節點)聯機後立即啓動看門狗。現在服務器和看門狗之間的連接損失可能有兩個原因:第一,網絡可能停止運行;其次,該節點可能會崩潰或被殺死。

我的代碼似乎工作,但我有輕微的懷疑下面發生的事情:

  • 當看門狗節點被關閉(或殺死或崩潰),並重新啓動時,服務器是否正確重新啓動看門狗。
  • 但是當網絡發生故障並且看門狗節點繼續運行時,服務器在連接重新建立時啓動一個新的看門狗,並且留下一個殭屍看門狗。

我的問題是

  • (A)我創建殭屍? (B)在網絡丟失的情況下,服務器如何檢查看門狗是否仍然存在(反之亦然)? (C)如果B可能,我該如何重新連接舊的服務器和舊的看門狗? (D)在我的設置中,有哪些其他主要(和次要)缺陷是你的尊敬的讀者?

編輯:diekill_dog消息是僞造非正常退出,並不會使它超越調試。

這裏去代碼:


-module (watchdog). 
-compile (export_all). 

init() -> 
    io:format ("Watchdog: Starting @ ~p.~n", [node() ]), 
    process_flag (trap_exit, true), 
    loop(). 

loop() -> 
    receive 
     die -> 1/0; 
     {'EXIT', _, normal} -> 
      io:format ("Watchdog: Server shut down.~n"); 
     {'EXIT', _, _} -> 
      io:format ("Watchdog: Restarting server.~n"), 
      spawn ('[email protected]', server, start, []); 
     _ -> loop() 
    end. 

-module (server). 
-compile (export_all). 

start() -> 
    io:format ("Server: Starting up.~n"), 
    register (server, spawn (fun init/0)). 

stop() -> 
    whereis (server) ! stop. 

init() -> 
    process_flag (trap_exit, true), 
    monitor_node ('[email protected]', true), 
    loop (down, none). 

loop (Status, Watchdog) -> 
    {NewStatus, NewWatchdog} = receive 
     die -> 1/0; 
     stop -> {stop, none}; 
     kill_dog -> 
      Watchdog ! die, 
      {Status, Watchdog}; 
     {nodedown, '[email protected]'} -> 
      io:format ("Server: Watchdog node has gone down.~n"), 
      {down, Watchdog}; 
     {'EXIT', Watchdog, noconnection} -> 
      {Status, Watchdog}; 
     {'EXIT', Watchdog, Reason} -> 
      io:format ("Server: Watchdog has died of ~p.~n", [Reason]), 
      {Status, spawn_link ('[email protected]', watchdog, init, []) }; 
     _ -> {Status, Watchdog} 
    after 2000 -> 
     case Status of 
      down -> checkNode(); 
      up -> {up, Watchdog} 
     end 
    end, 
    case NewStatus of 
     stop -> ok; 
     _ -> loop (NewStatus, NewWatchdog) 
    end. 

checkNode() -> 
    net_adm:world(), 
    case lists:any (fun (Node) -> Node =:= '[email protected]' end, nodes()) of 
     false -> 
      io:format ("Server: Watchdog node is still down.~n"), 
      {down, none}; 
     true -> 
      io:format ("Server: Watchdog node has come online.~n"), 
      monitor_node ('[email protected]', true), 
      Watchdog = spawn_link ('[email protected]', watchdog, init, []), 
      {up, Watchdog} 
    end. 

回答

1

使用global模塊註冊看門狗應防止您的關注:

watchdog.erl:

-module (watchdog). 
-compile (export_all). 

init() -> 
    io:format ("Watchdog: Starting @ ~p.~n", [node() ]), 
    process_flag (trap_exit, true), 
    global:register_name (watchdog, self()), 
    loop(). 

loop() -> 
    receive 
     die -> 1/0; 
     {'EXIT', _, normal} -> 
      io:format ("Watchdog: Server shut down.~n"); 
     {'EXIT', _, _} -> 
      io:format ("Watchdog: Restarting server.~n"), 
      spawn ('[email protected]', server, start, []); 
     _ -> loop() 
    end. 

服務器。erl:

checkNode() -> 
    net_adm:world(), 
    case lists:any (fun (Node) -> Node =:= '[email protected]' end, nodes()) of 
     false -> 
      io:format ("Server: Watchdog node is still down.~n"), 
      {down, none}; 
     true -> 
      io:format ("Server: Watchdog node has come online.~n"), 
      global:sync(), %% not sure if this is necessary 
      case global:whereis_name (watchdog) of 
       undefined -> 
        io:format ("Watchdog process is dead"), 
        Watchdog = spawn_link ('[email protected]', watchdog, init, []); 
       Watchdog -> 
        io:format ("Watchdog process is still alive") 
      end, 
      {up, Watchdog} 
    end. 
+0

非常感謝。當看門狗仍然活着時,我不需要調用'link/1'嗎?或者,在他們收到相互'''EXIT',Pid,noconnection'}'後,這些過程仍然是鏈接的? – Hyperboreus

+0

其實我並不確定(到目前爲止沒有使用分佈式Erlang)。 –