2017-09-29 39 views
2

我想讓Flink使用Zookeeper在HA模式下運行,但是當我嘗試通過殺死領導者JobManager來測試它時,我所有的備用jobmanage都被殺死了。Flink:HA模式殺死領導jobmanager終止備用jobmanagers

因此,作爲一名新的領導者而不是一個備用的工作管理者,他們都會被殺死,這是不應該發生的。

我的設置: 4臺服務器,其中3臺服務器運行Zookeeper,但只有1臺服務器將託管所有JobManager。

ad011.local: Zookeeper + Jobmanagers 
ad012.local: Zookeeper + Taskmanager 
ad013.local: Zookeeper 
ad014.local: nothing interesting 

我的主人文件看起來像這樣:

ad011.local:8081 
ad011.local:8082 
ad011.local:8083 

我弗林克-conf.yaml:

jobmanager.rpc.address: ad011.local 

blob.server.port: 6130,6131,6132 

jobmanager.heap.mb: 512 
taskmanager.heap.mb: 128 
taskmanager.numberOfTaskSlots: 4 
parallelism.default: 2 
taskmanager.tmp.dirs: /var/flink/data 

metrics.reporters: jmx 
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter 
metrics.reporter.jmx.port: 8789,8790,8791 

high-availability: zookeeper 
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181 

high-availability.zookeeper.path.root: /flink 
high-availability.zookeeper.path.cluster-id: /cluster-one 
high-availability.storageDir: /var/flink/recovery 
high-availability.jobmanager.port: 50000,50001,50002 

當我用start-cluster.sh腳本運行弗林克我看我的3個JobManagers正在運行,並進入WebUI,他們都指向ad011.local:8081,這是領導者。我猜是哪個好?

然後我嘗試通過使用kill殺死領導來測試故障轉移,然後我所有其他備用JobManagers也停止。

這是我在待機JobManager日誌中看到:

2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager    - Starting JobManager at akka.tcp://[email protected]:50002/user/jobmanager. 
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apa[email protected]72d546c8. 
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor   - Starting with JobManager akka.tcp://[email protected]:50002/user/jobmanager on port 8083 
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. 
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever  - New leader reachable under akka.tcp://[email protected]:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a. 
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService. 
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration. 
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor      - Association with remote system [akka.tcp://[email protected]:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager    - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer      - Stopped BLOB server at 0.0.0.0:6132 

任何幫助,將不勝感激。

回答

2

通過使用./bin/start-cluster.sh運行我的集羣而不是使用服務文件(它調用相同的腳本)解決了這個問題,服務文件顯然會殺死其他jobmanagers。