2017-08-14 87 views
0

我是在谷歌的對象檢測API重新訓練我自己的數據集,但遇到了一系列問題。培訓谷歌對象檢測API grpc錯誤

其中之一是:

"Traceback (most recent call last): 
    File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 
    "__main__", fname, loader, pkg_name) 
    File "/usr/lib/python2.7/runpy.py", line 72, in _run_code 
    exec code in run_globals 
    File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 198, in <module> 
    tf.app.run() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 44, in run 
    _sys.exit(main(_sys.argv[:1] + flags_passthrough)) 
    File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 194, in main 
    worker_job_name, is_chief, FLAGS.train_dir) 
    File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 290, in train 
    saver=saver) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 776, in train 
    master, start_standard_services=False, config=session_config) as sess: 
    File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__ 
    return self.gen.next() 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session 
    self.stop(close_summary_writer=close_summary_writer) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop 
    stop_grace_period_secs=self._stop_grace_secs) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 386, in join 
    six.reraise(*self._exc_info_to_raise) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session 
    start_standard_services=start_standard_services) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 713, in prepare_or_wait_for_session 
    max_wait_secs=max_wait_secs) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 387, in wait_for_session 
    is_ready, not_ready_msg = self._model_ready(sess) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 435, in _model_ready 
    return _ready(self._ready_op, sess, "Model not ready") 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 492, in _ready 
    ready_value = sess.run(op) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 767, in run 
    run_metadata_ptr) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 965, in _run 
    feed_dict_string, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1015, in _do_run 
    target_list, options, run_metadata) 
    File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1035, in _do_call 
    raise type(e)(node_def, op, message) 
UnavailableError: {"created":"@1502405189.800982817","description":"EOF","file":"external/grpc/src/core/lib/iomgr/tcp_posix.c","file_line":235,"grpc_status":14} 
"  
    pathname: "/var/sitecustomize/sitecustomize.py"  
} 

我不是一個GRPC就是太相信 - 所以我相當在的這個錯誤停頓。 任何人都可以幫助它會很棒! 謝謝!

+0

我運行在Mac OS X約塞米蒂10.10.5如果這是任何幫助 –

+0

這可能是一個錯誤OOM [1]。你在使用GPU嗎? [1]:https://stackoverflow.com/questions/45600567/connection-reset-by-peer-on-adapted-standard-ml-engine-object-detection-trainin – rhaertel80

回答

1

這可能是內存不足錯誤(請參閱this question)。

您可以嘗試使用較大的機器類型,特別是對於主機,例如large_model,complex_model_lcomplex_model_l_gpu。你這樣做是通過傳送一個文件的gcloud--config參數類似於如下內容:

trainingInput: 
    runtimeVersion: "1.0" 
    scaleTier: CUSTOM 
    masterType: complex_model_l_gpu 
    workerCount: 9 
    workerType: standard_gpu 
    parameterServerCount: 3 
    parameterServerType: standard 
+0

我使用以下--config文件: 'trainingInput: runtimeVersion: 「1.0」 scaleTier:CUSTOM 的MasterType:standard_gpu workerCount:5 workerType:standard_gpu parameterServerCount:3 parameterServerType:standard' 另外,我tensorflow不tensorflow GPU - 也許這是一個問題呢? –

+0

問題是你的masterType:標準的RAM太少。 當您使用GPU工作人員提交作業時,它將在具有啓用GPU的TensorFlow的計算機上運行。這應該加快使用對象檢測API的培訓。但是,也歡迎您嘗試不使用GPU,但我認爲這不是問題所在。 – rhaertel80