2017-01-06 212 views
0

我們已經使用TensorFlow服務來加載模型並實現Java gRPC客戶端。TensorFlow服務器關閉客戶端超時連接

正常它適用於小數據。但是如果我們要求更大的批量和數據差不多是1〜2M,服務器會關閉連接並迅速拋出內部錯誤。

我們還在https://github.com/tensorflow/serving/issues/284中打開了一個問題來跟蹤此問題。

Job aborted due to stage failure: Task 47 in stage 7.0 failed 4 times, most recent failure: Lost task 47.3 in stage 7.0 (TID 5349, xxx) 
io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: INTERNAL_ERROR 
Received Rst Stream 
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:230) 
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:211) 
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:144) 
at tensorflow.serving.PredictionServiceGrpc$PredictionServiceBlockingStub.predict(PredictionServiceGrpc.java:160) 

...... 

at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189) 
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
at org.apache.spark.scheduler.Task.run(Task.scala:91) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:219) 
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:745) 

Driver stacktrace: 

回答

1

如在above issue中可以看出,這是由超過4 MIB的默認最大消息大小消息引起的。較大消息的接收者需要明確允許較大的尺寸,或者發送者發送較小的消息。

gRPC很好,消息較大(甚至100s),但應用程序通常不會。最大消息大小已準備就緒,只允許準備接受它們的應用程序中的「大」消息。

+0

我們必須提高客戶端對於較大消息大小的限制。 – tobe

相關問題