2017-03-27 64 views
1

嗨我有一個可操作的情況,當試圖使用估計+實驗班進行分佈式訓練。tensorflow分佈式訓練瓦特/估計+實驗框架

下面是一個例子:https://gist.github.com/protoget/2cf2b530bc300f209473374cf02ad829

這是使用來自TF官方教程

  1. DNNClassifier一個簡單的例子
  2. 實驗框架
  3. 1工人和1個PS在同一主機不同的端口。

會發生什麼事是

1)當我開始PS作業,它看起來不錯:

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job ps -> {0 -> localhost:9000} 
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:200] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:9001} 
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:221] Started server with target: grpc://localhost:9000 

2)當我開始工人作業時,作業自行退出,不留記錄,在所有。

急切尋求幫助。

回答

1

我有同樣的問題,我終於得到解決方案。

的問題是在config._environment

config = {"cluster": {'ps':  ['127.0.0.1:9000'], 
         'worker': ['127.0.0.1:9001']}} 

if args.type == "worker": 
    config["task"] = {'type': 'worker', 'index': 0} 
else: 
    config["task"] = {'type': 'ps', 'index': 0} 

os.environ['TF_CONFIG'] = json.dumps(config) 

config = run_config.RunConfig() 

config._environment = run_config.Environment.CLOUD 

設置config._environment爲Environment.CLOUD

然後你可以有分佈式培訓系統。

我希望它能讓你快樂:)

1

我有同樣的問題,這是由於一些內部tensorflow代碼我想,我已經開了一個問題,關於SO已經爲此:TensorFlow: minimalist program fails on distributed mode

我還打開了拉請求:https://github.com/tensorflow/tensorflow/issues/8796

有兩種方法可以解決您的問題。由於這是由於您的ClusterSpec具有隱含的local環境,您可以嘗試設置另一個(googlecloud),但我無法向您保證其餘工作不會受到影響。所以我最好先查看一下代碼,然後嘗試自己修復本地模式,這就是爲什麼我解釋下面的原因。

你會看到它爲什麼在這些職位更精確地失敗的解釋,事實是谷歌一直很沉默到目前爲止我所做的是我修改他們的源代碼(在tensorflow/contrib/learn/python/learn/experiment.py):

# Start the server, if needed. It's important to start the server before 
# we (optionally) sleep for the case where no device_filters are set. 
# Otherwise, the servers will wait to connect to each other before starting 
# to train. We might as well start as soon as we can. 
config = self._estimator.config 
if (config.environment != run_config.Environment.LOCAL and 
    config.environment != run_config.Environment.GOOGLE and 
    config.cluster_spec and config.master): 
self._start_server() 

(這部分阻止服務器以本地模式啓動,如果您在集羣規範中沒有設置,則本地模式是您自己的,因此您應該簡單地評論config.environment != run_config.Environment.LOCAL and並且應該可以工作)。