在Tensorflow培訓中非常低的GPU使用率

我想爲10級圖像分類任務訓練一個簡單的多層感知器，這是Udacity深度學習課程任務的一部分。更確切地說，任務是對各種字體所呈現的字母進行分類（數據集稱爲notMNIST）。在Tensorflow培訓中非常低的GPU使用率

我最終得到的代碼看起來相當簡單，但無論我在訓練期間總是獲得非常低的GPU使用率。我用GPU-Z測量負載，並顯示只有25-30％。

這裏是我當前的代碼：

graph = tf.Graph() 
with graph.as_default(): 
    tf.set_random_seed(52) 

    # dataset definition 
    dataset = Dataset.from_tensor_slices({'x': train_data, 'y': train_labels}) 
    dataset = dataset.shuffle(buffer_size=20000) 
    dataset = dataset.batch(128) 
    iterator = dataset.make_initializable_iterator() 
    sample = iterator.get_next() 
    x = sample['x'] 
    y = sample['y'] 

    # actual computation graph 
    keep_prob = tf.placeholder(tf.float32) 
    is_training = tf.placeholder(tf.bool, name='is_training') 

    fc1 = dense_batch_relu_dropout(x, 1024, is_training, keep_prob, 'fc1') 
    fc2 = dense_batch_relu_dropout(fc1, 300, is_training, keep_prob, 'fc2') 
    fc3 = dense_batch_relu_dropout(fc2, 50, is_training, keep_prob, 'fc3') 
    logits = dense(fc3, NUM_CLASSES, 'logits') 

    with tf.name_scope('accuracy'): 
     accuracy = tf.reduce_mean(
      tf.cast(tf.equal(tf.argmax(y, 1), tf.argmax(logits, 1)), tf.float32), 
     ) 
     accuracy_percent = 100 * accuracy 

    with tf.name_scope('loss'): 
     loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=y)) 

    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) 
    with tf.control_dependencies(update_ops): 
     # ensures that we execute the update_ops before performing the train_op 
     # needed for batch normalization (apparently) 
     train_op = tf.train.AdamOptimizer(learning_rate=1e-3, epsilon=1e-3).minimize(loss) 

with tf.Session(graph=graph) as sess: 
    tf.global_variables_initializer().run() 
    step = 0 
    epoch = 0 
    while True: 
     sess.run(iterator.initializer, feed_dict={}) 
     while True: 
      step += 1 
      try: 
       sess.run(train_op, feed_dict={keep_prob: 0.5, is_training: True}) 
      except tf.errors.OutOfRangeError: 
       logger.info('End of epoch #%d', epoch) 
       break 

     # end of epoch 
     train_l, train_ac = sess.run(
      [loss, accuracy_percent], 
      feed_dict={x: train_data, y: train_labels, keep_prob: 1, is_training: False}, 
     ) 
     test_l, test_ac = sess.run(
      [loss, accuracy_percent], 
      feed_dict={x: test_data, y: test_labels, keep_prob: 1, is_training: False}, 
     ) 
     logger.info('Train loss: %f, train accuracy: %.2f%%', train_l, train_ac) 
     logger.info('Test loss: %f, test accuracy: %.2f%%', test_l, test_ac) 

     epoch += 1

這裏是我試過到目前爲止：

我改變了輸入管道從簡單到feed_dicttensorflow.contrib.data.Dataset。據我所知，它應該考慮輸入的效率，例如將數據加載到單獨的線程中。所以不應該有任何與輸入有關的瓶頸。
我收集了這裏建議的痕跡：https://github.com/tensorflow/tensorflow/issues/1824#issuecomment-225754659 但是，這些痕跡並沒有真正顯示任何有趣的東西。 > 90％的列車步驟是matmul操作。
更改批量大小。當我將它從128更改爲512時，負載從〜30％增加到〜38％，當我進一步增加到2048時，負載降至〜45％。我有6Gb GPU內存，數據集是單通道28x28圖像。我真的應該使用這麼大的批量？我應該進一步增加它嗎？

一般來說，我應該擔心低負荷，是否真的表明我訓練效率低下？

下面是批量處理128張圖像的GPU-Z屏幕截圖。當我在每個紀元後測量整個數據集的準確度時，您可以看到偶爾出現的尖峯到100％的低負載。

來源

2017-09-11 Alexey Petrenko

MNIST規模的網絡是微小的，很難實現高GPU（或CPU）的效率對他們來說，我認爲30％是不尋常的應用程序。在批量更大的情況下，您可以獲得更高的計算效率，這意味着您可以每秒處理更多示例，但是您也將獲得更低的統計效率，這意味着您需要處理更多示例才能達到目標準確度。所以這是一個折衷。對於像你這樣的小型角色模型，統計效率在100之後會很快下降，所以可能不值得嘗試增加批量來訓練。爲了推斷，你應該使用最大的批量大小。

來源

2017-09-11 00:38:30

謝謝你的快速回復！雅羅斯拉夫，你能提供一個暗示爲什麼會發生這種情況嗎？我的假設如下：只要當時只有一個訓練步驟完成，就沒有足夠的計算來飽和所有的GPU核心？所以，當我提供128個圖像批次時，它已經儘可能並行運行，但它可以做更多。 –

是的，沒有足夠的計算來飽和內核。另外，如果計算量相對於所需的內存帶寬或者內核啓動的開銷很小，那麼效率會很低。更重要的是要關注整體效率而不是GPU佔用率。 TitanX大型matmul可以獲得10T /秒，但在許多應用中，網絡運行速度低於1Top /秒，因此不到峯值效率的10％ –

在我的nVidia GTX 1080上，如果我在MNIST數據庫上使用卷積神經網絡，GPU負載約爲68％。

如果我切換到一個簡單的非卷積網絡，那麼GPU負載約爲20％。

您可以通過在教程Building Autoencoders in Keras by Francis Chollet中繼續構建更高級的模型來複制這些結果。

來源

2018-01-20 15:42:43 Contango

在Tensorflow培訓中非常低的GPU使用率

回答

相關問題