2016-10-04 29 views
0

我的程序將在面對一些時間(並非每天運行將面臨這樣的..),那麼如果面對這樣我總可以從我有程序崩潰之前,由於楠最後保存的模型重現此錯誤加載。 當從這個模型重新運行,使用所述模型以生成損失(I已打印損失並顯示沒有問題)第一列車過程似乎很好,但施加梯度後,嵌入變量的值將轉向楠。楠總結直方圖

那麼什麼是南問題的根本原因是什麼?困惑,不知道如何進一步調試,並將該軟件具有相同的數據,而params將主要運行正常,只有一些運行過程中遇到這個問題..

Loading existing model from: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000 
Train from restored model: /home/gezi/temp/image-caption//model.flickr.rnn2.nan/model.ckpt-18000 
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:245] PoolAllocator: After 5235 get requests, put_count=4729 evicted_count=1000 eviction_rate=0.211461 and unsatisfied allocation rate=0.306781 
I tensorflow/core/common_runtime/gpu/pool_allocator.cc:257] Raising pool_size_limit_ from 100 to 110 
2016-10-04 21:45:39 epoch:1.87 train_step:18001 duration:0.947 elapsed:0.947 train_avg_metrics:['loss:0.527'] ['loss:0.527'] 
2016-10-04 21:45:39 epoch:1.87 eval_step: 18001 duration:0.001 elapsed:0.948 ratio:0.001 
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1 
    [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] 
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1 
    [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] 
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1 
    [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] 
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1 
    [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] 
W tensorflow/core/framework/op_kernel.cc:968] Invalid argument: Nan in summary histogram for: rnn/HistogramSummary_1 
    [[Node: rnn/HistogramSummary_1 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"](rnn/HistogramSummary_1/tag, rnn/image_text_sim/image_mlp/w_h/read/_309)]] 
Traceback (most recent call last): 
    File "./train.py", line 308, in <module> 
    tf.app.run() 

回答

4

通常楠模型不穩定的跡象,例如,爆炸梯度。它可能沒有被注意到,損失就會停止收縮。嘗試記錄權重摘要會使問題變得明確。我建議你首先要降低學習率。如果它不起作用,請在此處發佈您的代碼。沒有看到它,很難提出任何更具體的建議。

1

過程中,該模型可能會噴涌出來只有一個預測一流的培訓初始迭代有時會發生。如果不是偶然的機會,那麼在所有訓練樣本中類別都是0,那麼可以存在NaN值分類交叉熵丟失

請確保您計算損失時引進一個很小的值,例如tf.log(predictions + 1e-8)。這將有助於克服這種數值不穩定性。

+0

超好用!非常感謝!當你有一個非常稀少的正面例子的數據集時,你必須處理minibatches,儘管沒有正確的例子洗牌......解決了它! –