我正在使用spark version 1.6.3
和yarn version 2.7.1.2.3
附帶HDP-2.3.0.0-2557
。因爲在我使用的HDP版本中,火花版本太舊了,我更喜歡使用另一個火花作爲紗線模式。爲什麼由於連接被拒絕而導致YARN上的Spark應用程序因FetchFailedException失敗?
這是我如何運行火星殼;
./spark-shell --master yarn-client
一切似乎很好,sparkContext
被初始化,sqlContext
被初始化。我甚至可以訪問我的配置單元表。但在某些情況下,當它試圖連接到塊管理器時會遇到麻煩。
我不是一個專家,但我認爲,當我在紗線模式下運行時,塊管理者正在我的紗線羣上運行。這對我來說似乎是一個網絡問題,並且不想在這裏問它。但是,這在一些我還不知道的情況下會發生。所以這讓我覺得這可能不是網絡問題。
這是代碼;
def df = sqlContext.sql("select * from city_table")
下面的代碼工作正常;
df.limit(10).count()
但是大小超過10,我不知道,每次運行都會發生這種變化;
df.count()
這引起了一個例外;
6/12/30 07:31:04 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 157 bytes
16/12/30 07:31:19 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 8, 172.27.247.204): FetchFailed(BlockManagerId(2, 172.27.247.204, 56093), shuffleId=2, mapId=0, reduceId=0, message=
org.apache.spark.shuffle.FetchFailedException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:300)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:51)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:504)
at org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.<init>(TungstenAggregationIterator.scala:686)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:95)
at org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:86)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Failed to connect to /172.27.247.204:56093
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
Caused by: java.net.ConnectException: Connection refused: /172.27.247.204:56093
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
)
我剛剛意識到這發生在有多個任務要洗牌時。
問題是什麼,是性能問題還是其他網絡問題,我看不到。什麼是洗牌?如果是網絡問題,它是在我的火花和紗線之間,還是紗線本身存在問題?
謝謝。
編輯:
我只是看到日誌中的東西;
17/01/02 06:45:17 INFO DAGScheduler: Executor lost: 2 (epoch 13)
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
17/01/02 06:45:17 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, 172.27.247.204, 51809)
17/01/02 06:45:17 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
17/01/02 06:45:17 INFO YarnScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool
17/01/02 06:45:24 INFO BlockManagerMasterEndpoint: Registering block manager 172.27.247.204:51809 with 511.1 MB RAM, BlockManagerId(2, 172.27.247.204, 51809)
有時候,重試它的另一個塊經理的作品,但是,由於時間所允許的最大數量是4爲超過默認情況下,它永遠不會結束的大部分時間。
編輯2:
紗是真的真的沉默,但我認爲這是網絡的問題,我可以迭代問題的地方;
此火花部署在HDP環境之外。當火花向紗線提交應用時,紗線會通知火花司機關於塊管理者和執行者。執行程序是HDP羣集中的數據節點,在其專用網絡中具有不同的IP。但是,當通知羣集外部的火花驅動程序時,它會爲所有執行程序提供相同且始終單一的IP。這是因爲HDP羣集中的所有節點都通過路由器出局並具有相同的IP。假設IP是150.150.150.150
,當火花驅動程序需要連接並從那個執行者那裏詢問某個東西時,它會用這個IP來嘗試它。但是這個IP實際上是整個集羣的外部IP地址,而不是單個數據節點IP。
是否有辦法讓紗線通過其私人IP通知執行者(塊管理員)。因爲,他們的私有IP也可以從該火花驅動程序正在處理的機器訪問。
紗線日誌中顯示沒有什麼不同,連接問題日誌相同。我不知道,也許紗保留這些BlockManager日誌在不同的地方,或他們的日誌被禁用某種方式。因爲從BlockManager方面沒有日誌,所以即使它成功了。 目前無法訪問spark ui。這是我試圖解決的另一個問題。代理不起作用。 –
查找ExecutorLost事件。 –
我看不到有關該日誌的任何日誌。我猜想,紗線會以某種方式默默地進行。 –