2016-07-22 118 views
0

根據指令here,我們有一個使用CoreOS在AWS上運行的小型測試Kubernetes集羣。目前,這僅包含主節點和工作節點。在過去的幾周裏,我們一直在運行這個集羣,我們注意到工作者實例偶爾會失敗。第一次發生這種情況後,實例隨後被自動擴展組殺死並重新啓動。今天發生了同樣的事情,但我們能夠在關閉實例之前登錄實例並檢索一些信息,但它仍然存在究竟是什麼導致了這個問題,我不清楚。Kubernetes節點失敗(CoreOS/AWS/Kubernetes堆棧)

節點故障似乎發生在不規則的基礎上,並且沒有證據表明有任何異常情況會導致這種情況發生(外部負載等)。

偶然發生故障(kubernetes節點狀態爲「未就緒」),實例仍在運行,但kubelet和docker服務(start failed with result 'dependency')處於非活動狀態。 flanneld服務正在運行,但是在看到節點故障後的重啓時間。

節點故障發生時的日誌似乎沒有顯示任何明顯的指向故障原因的東西。有大約在時間一對夫婦kubelet,包裝錯誤的失敗被認爲:

`Jul 22 07:25:33 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:33.121506 1204 kubelet.go:2745] Error updating node status, will retry: nodes "ip-10-0-0-92.ec2.internal" cannot be updated: the object has been modified; please apply your changes to the latest version and try again` 

`Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.557047 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.1462693ef85b56d8", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687622", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448947, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388015022, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.1462693ef85b56d8" not found' (will not retry!) 
Jul 22 07:25:34 ip-10-0-0-92.ec2.internal kubelet-wrapper[1204]: E0722 07:25:34.560636 1204 event.go:193] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"ip-10-0-0-92.ec2.internal.14626941554cc358", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"4687645", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-10-0-0-92.ec2.internal", UID:"ip-10-0-0-92.ec2.internal", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node ip-10-0-0-92.ec2.internal status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"ip-10-0-0-92.ec2.internal"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63604448957, nsec:0, loc:(*time.Location)(0x3b1a5c0)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63604769134, nsec:388022975, loc:(*time.Location)(0x3b1a5c0)}}, Count:2, Type:"Normal"}': 'events "ip-10-0-0-92.ec2.internal.14626941554cc358" not found' (will not retry!)` 

其次是什麼樣子有些ETCD錯誤:

`Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [WARNING][1305/140149086452400] calico.etcddriver.driver 810: etcd watch returned bad HTTP status topoll on index 5237916: 400 
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [ERROR][1305/140149086452400] calico.etcddriver.driver 852: Error from etcd for index 5237916: {u'errorCode': 401, u'index': 5239005, u'message': u'The event in requested index is outdated and cleared', u'cause': u'the requested history has been cleared [5238006/5237916]'}; triggering a resync. 
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher etcd response time: 0 in 630.6s (0.000/s) min=0.000ms mean=0.000ms max=0.000ms 
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 916: STAT: Final watcher processing time: 7 in 630.6s (0.011/s) min=90066.312ms mean=90078.569ms max=90092.505ms 
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,721 [INFO][1305/140149086452400] calico.etcddriver.driver 919: Watcher thread finished. Signalled to resync thread. Was at index 5237916. Queue length is 1. 
Jul 22 07:27:04 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:27:04,743 [WARNING][1305/140149192694448] calico.etcddriver.driver 291: Watcher died; resyncing.` 

and a few minutes later a large number of failed connections to the master (10.0.0.50): 

`Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [WARNING][1305/140149086452400] urllib3.connectionpool 647: Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7700b85b90>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': http://10.0.0.50:2379/v2/keys/calico/v1?waitIndex=5239006&recursive=true&wait=true 
Jul 22 07:36:41 ip-10-0-0-92.ec2.internal rkt[1214]: 2016-07-22 07:36:37,641 [INFO][1305/140149086452400] urllib3.connectionpool 213: Starting new HTTP connection (2): 10.0.0.50` 

雖然這些錯誤都可能涉及到節點/實例失敗,這些對我來說並不意味着很重要,當然似乎也沒有提出潛在的原因 - 但是如果任何人在這裏可以看到任何可能導致節點/實例失敗的原因(以及我們如何能夠去糾正這一點),將不勝感激!

+0

您使用了多大的實例?我已經看到,當我使用太小的實例時,如果集羣非常小(即1名工人),集羣不能保持穩定。 –

回答

0

你的描述和日誌中的某些東西讓我感到困惑,你說你在你的日誌中使用了docker runtime,你說你在你的集羣中使用法蘭絨,你的日誌中有印花布...

無論如何,從你提供的日誌來看,它更像是你的etcd是down的......這使得kubelet和calico無法更新他們的狀態,並且apiserver會認爲他們失敗了。這裏沒有足夠的信息,我只能建議你下一次你看到這個需要備份etcd的日誌...

另一個建議是,最好不要使用相同的etcd kubenetes集羣和印花布...