嗨我有一個可操作的情況,當試圖使用估計+實驗班進行分佈式訓練。tensorflow分佈式訓練瓦特/估計+實驗框架
下面是一個例子:https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829
這是使用來自TF官方教程
- DNNClassifier一個簡單的例子
- 實驗框架
- 1工人和1個PS在同一主機不同的端口。
會發生什麼事是
1)當我開始PS作業,它看起來不錯:
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000}
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001}
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000
2)當我開始工人作業時,作業自行退出,不留記錄,在所有。
急切尋求幫助。