谷歌雲ML退出與245非零狀態訓練

當我嘗試使用此示例代碼訓練在谷歌雲ML我的模型：谷歌雲ML退出與245非零狀態訓練

import keras 
from keras import optimizers 
from keras import losses 
from keras import metrics 
from keras.models import Model, Sequential 
from keras.layers import Dense, Lambda, RepeatVector, TimeDistributed 
import numpy as np 

def test(): 
    model = Sequential() 
    model.add(Dense(2, input_shape=(3,))) 
    model.add(RepeatVector(3)) 
    model.add(TimeDistributed(Dense(3))) 
    model.compile(loss=losses.MSE, 
        optimizer=optimizers.RMSprop(lr=0.0001), 
        metrics=[metrics.categorical_accuracy], 
        sample_weight_mode='temporal') 
    x = np.random.random((1, 3)) 
    y = np.random.random((1, 3, 3)) 
    model.train_on_batch(x, y) 

if __name__ == '__main__': 
    test()

，我得到這個錯誤：

The replica master 0 exited with a non-zero status of 245. Termination reason: Error.

詳細的錯誤產量大，所以我把它粘貼here in pastebin

來源

2017-04-27 Alex

在console.google.com中，轉到漢堡包菜單，選擇「ML Engine> Jobs」，然後單擊您的工作。滾動到底部。你的內存使用情況如何？你可以有OOMed嗎？ – rhaertel80

這個特殊的工作'這個圖表沒有數據'。但對於我的其他工作來說，這更復雜，並且具有相同的錯誤，內存使用量爲0.0359 – Alex

日誌輸出表明您正在遇到分段錯誤。通過您的Cloud ML作業，您可以指定要使用哪個版本的TensorFlow？ –

問題已解決。我所要做的就是使用tensorflow 1.1.0代替默認值1.0.1

來源

2017-04-28 15:40:31 Alex

你是如何改變tensorflow版本的？ –

@BadgerCat只需添加到setup.py安裝需求tensorflow == 1.1.0 – Alex

注意此輸出：

Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.test', '--job-dir', u'gs://my_test_bucket_keras/s_27_100630']' returned non-zero exit status -11.

我想google雲會運行一個名爲--job-dir的額外參數。所以也許你可以嘗試在示例代碼中添加下面的代碼？

import ... 
import argparse 

def test(): 
model = Sequential() 
model.add(Dense(2, input_shape=(3,))) 
model.add(RepeatVector(3)) 
model.add(TimeDistributed(Dense(3))) 
model.compile(loss=losses.MSE, 
       optimizer=optimizers.RMSprop(lr=0.0001), 
       metrics=[metrics.categorical_accuracy], 
       sample_weight_mode='temporal') 
x = np.random.random((1, 3)) 
y = np.random.random((1, 3, 3)) 
model.train_on_batch(x, y) 

if __name__ == '__main__': 
    parser = argparse.ArgumentParser() 
    # Input Arguments 
    parser.add_argument(
     '--job-dir', 
     help='GCS location to write checkpoints and export models', 
     required=True 
    ) 
    args = parser.parse_args() 
    arguments = args.__dict__ 

    test() 
    # test(**arguments) # or if you want to use this job_dir parameter in your code

不是100％肯定這會起作用，但我認爲你可以試一試。我也有一個post here做類似的事情，也許你也可以看看那裏。

來源

2017-04-27 14:50:11

謝謝，實際上，當我開始使用谷歌ML時，我遵循本教程，它的工作。但看起來像代碼不是問題。 – Alex

谷歌雲ML退出與245非零狀態訓練

回答

相關問題