2016-03-01 87 views
1

我們有一個Mesos集羣,並通過Marathon在帶有Docker容器的Mesos-Slave上啓動任務。Destroy來自Marathon的碼頭集裝箱殺死Mesos slave

整個系統運行良好,但時常出現一個非常奇怪的問題:當我們試圖通過Marathon銷燬/重新部署任務時,mesos-slave被目標Docker容器退出而終止。這是錯誤日誌我:

Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465544 4094 docker.cpp:1592] Executor for container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' has exited 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465736 4094 docker.cpp:1390] Destroying container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.465812 4094 docker.cpp:1494] Running docker stop on container 'eadfb756-b653-42eb-977a-c16c78b1a7c5' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466089 4098 slave.cpp:3440] Executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 exited with status 0 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.466167 4098 slave.cpp:3544] Cleaning up executor 'prod-xxxxxxx-data-collector-writer.6d832d68-d519-11e5-acca-00505692154c' of framework 8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: F0229 19:31:51.470055 4098 slave.cpp:3570] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: *** Check failure stack trace: *** 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c2144dd google::LogMessage::Fail() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c21621c google::LogMessage::SendToLog() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.566812 4099 docker.cpp:1592] Executor for container 'e2d9c750-88b7-4247-b696-6589665d6a66' has exited 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c2140cc google::LogMessage::Flush() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569646 4099 docker.cpp:1390] Destroying container 'e2d9c750-88b7-4247-b696-6589665d6a66' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569757 4099 docker.cpp:1592] Executor for container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' has exited 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569787 4099 docker.cpp:1390] Destroying container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569818 4099 docker.cpp:1494] Running docker stop on container 'e2d9c750-88b7-4247-b696-6589665d6a66' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: I0229 19:31:51.569849 4099 docker.cpp:1494] Running docker stop on container 'f51c68b8-c64d-47ea-a629-8516dcc90dba' 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c216b19 google::LogMessageFatal::~LogMessageFatal() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3bc99f2e mesos::internal::slave::Slave::removeExecutor() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3bcaca60 mesos::internal::slave::Slave::executorTerminated() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c1c6541 process::ProcessManager::resume() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3c1c683f process::internal::schedule() 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3ad4a1e0 (unknown) 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3afa3df5 start_thread 
Feb 29 19:31:51 mesos-slave3.ourcompany.com mesos-slave[4093]: @  0x7f8c3a7b41ad __clone 
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service: main process exited, code=killed, status=6/ABRT 
Feb 29 19:31:51 mesos-slave3.ourcompany.com systemd[1]: Unit mesos-slave.service entered failed state. 
Feb 29 19:32:11 mesos-slave3.ourcompany.com systemd[1]: mesos-slave.service holdoff time over, scheduling restart. 

在泊塢容器啓動的任務是AKKA應用,併爲整個系統的環境信息是:

OS:

CentOS Linux release 7.1.1503 (Core)

內核:

3.10.0-229.el7.x86_64

JDK的所有機器上:

java version "1.7.0_91" 
OpenJDK Runtime Environment (rhel-2.6.2.1.el7_1-x86_64 u91-b00) 
OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode) 

Mesos:

0.25, installed by yum from mesosphere repo 

Mesos主站配置:

--zk=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --port=5050 --log_dir=/var/log/mesos --cluster=mesos-prod-cluster --hostname=<real hostname> --ip=<real ip> --quorum=3 --registry_fetch_timeout=5mins --work_dir=/var/lib/mesos 

Mesos從機配置:

--master=zk://zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster --log_dir=/var/log/mesos --attributes=env:prod --containerizers=docker,mesos --docker_remove_delay=2weeks --executor_registration_timeout=30mins --hostname=<real slave hostname> 

馬拉松信息:

{ 
"name": "marathon", 
"version": "0.11.1", 
"elected": true, 
"leader": "<leader_ip>:8080", 
"frameworkId": "8d26b713-c3cd-4e9b-956d-24f63b1320e0-0000", 
"marathon_config": { 
    "master": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/mesos-cluster", 
    "failover_timeout": 604800, 
    "framework_name": "marathon", 
    "ha": true, 
    "checkpoint": true, 
    "local_port_min": 10000, 
    "local_port_max": 20000, 
    "executor": "//cmd", 
    "hostname": "<hostname>", 
    "webui_url": null, 
    "mesos_role": null, 
    "task_launch_timeout": 600000, 
    "reconciliation_initial_delay": 15000, 
    "reconciliation_interval": 300000, 
    "marathon_store_timeout": 2000, 
    "mesos_user": "root", 
    "leader_proxy_connection_timeout_ms": 5000, 
    "leader_proxy_read_timeout_ms": 10000, 
    "mesos_leader_ui_url": "http://<leader_ip>:5050/" 
}, 
"zookeeper_config": { 
    "zk": "zk-node1:2181,zk-node2:2181,zk-node3:2181,zk-node4:2181,zk-node5:2181/marathon-cluster", 
    "zk_timeout": 10000, 
    "zk_session_timeout": 1800000, 
    "zk_max_versions": 25 
}, 
"event_subscriber": { 
    "type": "http_callback", 
    "http_endpoints": null 
}, 
"http_config": { 
    "assets_path": null, 
    "http_port": 8080, 
    "https_port": 8443 
} 

}

泊塢版本:

Client: 
Version:  1.9.1 
API version: 1.21 
Go version: go1.4.2 
Git commit: a34a1d5 
Built:  Fri Nov 20 13:25:01 UTC 2015 
OS/Arch:  linux/amd64 

Server: 
Version:  1.9.1 
API version: 1.21 
Go version: go1.4.2 
Git commit: a34a1d5 
Built:  Fri Nov 20 13:25:01 UTC 2015 
OS/Arch:  linux/amd64 

泊塢窗信息:

Containers: 330 
Images: 509 
Server Version: 1.9.1 
Storage Driver: devicemapper 
Pool Name: docker-253:0-68977907-pool 
Pool Blocksize: 65.54 kB 
Base Device Size: 107.4 GB 
Backing Filesystem: 
Data file: /dev/loop0 
Metadata file: /dev/loop1 
Data Space Used: 23.68 GB 
Data Space Total: 107.4 GB 
Data Space Available: 27.51 GB 
Metadata Space Used: 63.75 MB 
Metadata Space Total: 2.147 GB 
Metadata Space Available: 2.084 GB 
Udev Sync Supported: true 
Deferred Removal Enabled: false 
Deferred Deletion Enabled: false 
Deferred Deleted Device Count: 0 
Data loop file: /var/lib/docker/devicemapper/devicemapper/data 
Metadata loop file: /var/lib/docker/devicemapper/devicemapper/metadata 
Library Version: 1.02.93-RHEL7 (2015-01-28) 
Execution Driver: native-0.2 
Logging Driver: json-file 
Kernel Version: 3.10.0-229.el7.x86_64 
Operating System: CentOS Linux 7 (Core) 
CPUs: 4 
Total Memory: 15.67 GiB 
Name: mesos-slave3.gz.yougola.com 
ID: QB4G:C2HK:CBPR:G5ID:6OCU:DFEC:USBP:ECLQ:FWOQ:ZGHS:JIU5:JNN4 

服務,包括碼頭工人,Mesos法師,Mesos從,馬拉松都是由systemd管理。

+0

不知道這是否適用於您,但由於他們還沒有修復,我已經通過使用「殺死和縮放」,然後縮小後綴來解決此問題。 – Blanco

+0

@Blanco,感謝您的信息,我想這是由於我將mesos工作目錄放在/ tmp目錄中,某些文件可能會被其他服務意外刪除。在我改變工作目錄後,問題從未發生過。 – shizhz

回答

2

這很奇怪,不幸。看起來它沒有通過這個檢查: https://github.com/apache/mesos/blob/0.25.0/src/slave/slave.cpp#L3570 ,因爲它找不到執行器sentinel文件的路徑。

能否請您在https://issues.apache.org/jira/browse/MESOS提交一個新的JIRA,以便我們跟蹤並解決這個問題?

+1

謝謝,我在https://issues.apache.org/jira/browse/MESOS-4827創建了一個新問題:-) – shizhz

+0

從/ tmp /移動了mesos的工作目錄後,問題再也沒有出現。 – shizhz