tensorflow分佈式訓練瓦特/估計+實驗框架

嗨我有一個可操作的情況，當試圖使用估計+實驗班進行分佈式訓練。tensorflow分佈式訓練瓦特/估計+實驗框架

下面是一個例子：https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829

這是使用來自TF官方教程

DNNClassifier一個簡單的例子
實驗框架
1工人和1個PS在同一主機不同的端口。

會發生什麼事是

1）當我開始PS作業，它看起來不錯：

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000} 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001} 
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000

2）當我開始工人作業時，作業自行退出，不留記錄，在所有。

急切尋求幫助。

來源

2017-03-27 appledore

我有同樣的問題，我終於得到解決方案。

的問題是在config._environment

config = {"cluster": {'ps':  ['127.0.0.1:9000'], 
         'worker': ['127.0.0.1:9001']}} 

if args.type == "worker": 
    config["task"] = {'type': 'worker', 'index': 0} 
else: 
    config["task"] = {'type': 'ps', 'index': 0} 

os.environ['TF_CONFIG'] = json.dumps(config) 

config = run_config.RunConfig() 

config._environment = run_config.Environment.CLOUD

設置config._environment爲Environment.CLOUD。

然後你可以有分佈式培訓系統。

我希望它能讓你快樂:)

來源

2017-05-16 05:46:57 Hulk

我有同樣的問題，這是由於一些內部tensorflow代碼我想，我已經開了一個問題，關於SO已經爲此：TensorFlow: minimalist program fails on distributed mode。

我還打開了拉請求：https://github.com/tensorflow/tensorflow/issues/8796。

有兩種方法可以解決您的問題。由於這是由於您的ClusterSpec具有隱含的local環境，您可以嘗試設置另一個（google或cloud），但我無法向您保證其餘工作不會受到影響。所以我最好先查看一下代碼，然後嘗試自己修復本地模式，這就是爲什麼我解釋下面的原因。

你會看到它爲什麼在這些職位更精確地失敗的解釋，事實是谷歌一直很沉默到目前爲止我所做的是我修改他們的源代碼（在tensorflow/contrib/learn/python/learn/experiment.py）：

# Start the server, if needed. It's important to start the server before 
# we (optionally) sleep for the case where no device_filters are set. 
# Otherwise, the servers will wait to connect to each other before starting 
# to train. We might as well start as soon as we can. 
config = self._estimator.config 
if (config.environment != run_config.Environment.LOCAL and 
    config.environment != run_config.Environment.GOOGLE and 
    config.cluster_spec and config.master): 
self._start_server()

（這部分阻止服務器以本地模式啓動，如果您在集羣規範中沒有設置，則本地模式是您自己的，因此您應該簡單地評論config.environment != run_config.Environment.LOCAL and並且應該可以工作）。

來源

2017-06-20 07:58:46

tensorflow分佈式訓練瓦特/估計+實驗框架

回答

相關問題