阻塞tf.contrib.StagingArea的得到（）和put（）操作

工作環境阻塞tf.contrib.StagingArea的得到（）和put（）操作

TensorFlow發行版本：1.3.0 RC2
TensorFlow Git版本：V1.3.0，rc1- 994-gb93fd37
操作系統：CentOS的Linux的釋放1511年2月7日（核心）

問題場景

我正在使用TensorFlow StagingArea操作來提高輸入管道的效率。這裏是一個構建管道輸入我的代碼段的一部分：

train_put_op_list = [] 
    train_get_op_list = [] 
    val_put_op_list = [] 
    val_get_op_list = [] 
    with tf.variable_scope(tf.get_variable_scope()) as vscope: 
     for i in range(4): 
      with tf.device('/gpu:%d'%i): 
       with tf.name_scope('GPU-Tower-%d'%i) as scope: 
        trainstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32], 
                   shapes=[[64, 221, 221, 3],[64]], 
                     capacity=0) 
        valstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32], 
                     shapes=[[128, 221, 221, 3],[128]], 
                     capacity=0) 
        train_put_op_list.append(trainstagingarea.put(train_iterator.get_next())) 
        val_put_op_list.append(valstagingarea.put(val_iterator.get_next())) 
        train_get_op_list.append(trainstagingarea.get()) 
        val_get_op_list.append(valstagingarea.get()) 
        with tf.device('/cpu:0'): 
         worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer(), trainable=False) 
        workcondition = tf.equal(worktype, 1) 
        #elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterator.get_next()) 
        elem = tf.cond(workcondition, lambda: train_get_op_list[i], lambda: val_get_op_list[i]) 
        # This is followed by the network construction and optimizer

現在在執行的時候，我第一次運行put() OPS幾次，然後繼續運行迭代。它如下所示：

with tf.Session(config=config) as sess: 
     sess.run(init_op) 
     sess.run(iterator_training_op) 
     sess.run(iterator_validation_op) 
     sess.run(tf.assign(worktype, 0)) 
     for i in range(4): 
      sess.run(train_put_op_list) 
      sess.run(val_put_op_list) 
     writer = tf.summary.FileWriter('.', graph=tf.get_default_graph()) 
     epoch = 0 
     iter = 0 
     previous = 0 
     while(epoch<10): 
      try: 
       if(PROCESSINGTYPE is 'validation'): 
        sess.run(val_put_op_list) 
        [val_accu, summaries, numsamp] = sess.run([running_accuracy, validation_summary_op, processed]) 
        previous+=numsamp 
        print("Running Accuracy = {} : Number of sample processed = {} ".format(val_accu, previous)) 
       else: 
        sess.run(train_put_op_list) 
        [loss_value, _, train_accu, summaries, batch_accu, numsamp] = sess.run([total_loss, apply_gradient_op, running_accuracy, training_summary_op, batch_accuracy, pr\ 
ocessed]) 
        #Remaining part of the code (not important for question)

問題描述

使用StagingArea的提高了速度基本上（幾乎3-4倍）。但是，代碼由於某個塊而掛起。我不確定該塊是否來自get()或put()操作。下面是實際的輸出：

# Validation is done first and the following is the output 
Running Accuracy = 0.0 : Number of sample processed = 512 
Running Accuracy = 0.00390625 : Number of sample processed = 1024 
Running Accuracy = 0.0 : Number of sample processed = 1536 
Running Accuracy = 0.001953125 : Number of sample processed = 2048 
# The code hangs here

你可以注意到，在tf.Session() as sess:開始，get()和put() OPS是爲4時間運行。輸出也限制爲4行。這意味着， sess.run(val_put_op_list)內while循環不會做任何事情。所以，get()被sess.run(running_accuracy)...調用時，在4行後發現StagingArea爲空，因此發生阻塞。

我正確的分析了這個問題嗎？
這裏使用get()和put() ops的正確方法是什麼？
如果StagingArea已滿並且put()被阻止，那麼是否也會阻止整個代碼？ TensorFlow文檔沒有提到任何有關它的信息。

來源

2017-09-11 Ujjwal

它就像一個普通的隊列 - 在空的舞臺上使用「get」或在完整的舞臺上使用「put」會掛斷你的session.run。你有沒有看到這個使用例子[tf_cnn_benchmarks.py]（https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py）？請注意，它有額外的邏輯來啓動數據集的隊列運行程序 –

但這裏沒有發生。每個集結區的容量是'5'。在開始4中運行'put（）'ops，然後在循環內運行一個'put' op。然後運行'get（）'op，然後運行另一個'put' op。此外，如果你仔細閱讀我的問題，並研究輸出，你會看到我的問題 – Ujjwal

哦，我沒有研究代碼，這是關於你的最後一個問題，當暫存區域已滿 –

看看https://github.com/tensorflow/tensorflow/pull/13684。這解決了一些僵局，並可能進入1.4.0。免責聲明：我不是張張花。

來源

2017-10-13 17:49:37 Simon

不幸的是，這顯然還沒有成爲1.4.1版本。 – rerx

它看起來像已被添加到1.5.0-rc0。 – rerx

阻塞tf.contrib.StagingArea的得到（）和put（）操作

回答

相關問題