0
我正在研究MXNet,一個深度學習庫。我實現的結構是在單機和分佈式CPU機器上。我遵循MXNet官方網站上的tutorial。 單機上的執行沒有任何問題,我得到了結果。權限在分佈式Mxnet培訓中被拒絕
然後我試着用多臺CPU機器進行分佈式培訓。 我在AWS,亞馬遜虛擬機上創建了一個帳戶,並啓動了3個t2.micro ubuntu。 我鍵入以下命令行:
../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync
此命令行假設之上運行的分佈式版本培訓2名工人和1臺服務器。
不幸的是,我得到了一個錯誤。我知道有被拒絕的權限,但我試着用下面的命令從服務器訪問其他兩名工人和它的作品:
ssh -i key.pem [email protected] number.
以下是錯誤
Permission denied (publickey).
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_WORKER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=worker; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
Permission denied (publickey).
Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/ssh.py", line 60, in run
subprocess.check_call(prog, shell = True)
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'ssh -o StrictHostKeyChecking=no ip -p 22 'export LD_LIBRARY_PATH=:/usr/local/cuda/lib64; export DMLC_SERVER_ID=0; export DMLC_PS_ROOT_URI=ip; export DMLC_ROLE=server; export DMLC_PS_ROOT_PORT=9091; export DMLC_NUM_WORKER=1; export DMLC_NUM_SERVER=1; cd /home/ubuntu/Research/code/mxnet/example/image-classification/; python train_mnist.py --network lenet --kv-store dist_sync'' returned non-zero exit status 255
[03:48:39] /home/ubuntu/mxnet/dmlc-core/include/dmlc/./logging.h:300: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Traceback (most recent call last):
File "train_mnist.py", line 76, in <module>
fit.fit(args, sym, get_mnist_iter)
File "/home/ubuntu/Research/code/mxnet/example/image-classification/common/fit.py", line 97, in fit
kv = mx.kvstore.create(args.kv_store)
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/kvstore.py", line 403, in create
ctypes.byref(handle)))
File "/usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/base.py", line 77, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [03:48:39] src/kvstore/kvstore.cc:37: compile with USE_DIST_KVSTORE=1 to use dist_sync
Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN4dmlc15LogMessageFatalD1Ev+0x3c) [0x7ff5b87a156c]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(_ZN5mxnet7KVStore6CreateEPKc+0x5f4) [0x7ff5b91323d4]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet-0.9.4-py2.7.egg/mxnet/libmxnet.so(MXKVStoreCreate+0xd) [0x7ff5b905b14d]
[bt] (3) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7ff5bb86dadc]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7ff5bb86d40c]
[bt] (5) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7ff5bba845fe]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7ff5bba85f9e]
[bt] (7) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (8) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
[bt] (9) python(PyEval_EvalFrameEx+0x7e8) [0x524338]
Exception in thread Thread-1:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "/home/ubuntu/Research/code/mxnet/tools/../dmlc-core/tracker/dmlc_tracker/tracker.py", line 363, in <lambda>
target=(lambda: subprocess.check_call(self.cmd, env=env, shell=True)), args=())
File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command 'python train_mnist.py --network lenet --kv-store dist_sync' returned non-zero exit status 1