2017-09-11 67 views
0

工作環境阻塞tf.contrib.StagingArea的得到()和put()操作

  • TensorFlow發行版本:1.3.0 RC2
  • TensorFlow Git版本:V1.3.0,rc1- 994-gb93fd37
  • 操作系統:CentOS的Linux的釋放1511年2月7日(核心)

問題場景

我正在使用TensorFlow StagingArea操作來提高輸入管道的效率。這裏是一個構建管道輸入我的代碼段的一部分:

train_put_op_list = [] 
    train_get_op_list = [] 
    val_put_op_list = [] 
    val_get_op_list = [] 
    with tf.variable_scope(tf.get_variable_scope()) as vscope: 
     for i in range(4): 
      with tf.device('/gpu:%d'%i): 
       with tf.name_scope('GPU-Tower-%d'%i) as scope: 
        trainstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32], 
                   shapes=[[64, 221, 221, 3],[64]], 
                     capacity=0) 
        valstagingarea = tf.contrib.staging.StagingArea(dtypes=[tf.float32, tf.int32], 
                     shapes=[[128, 221, 221, 3],[128]], 
                     capacity=0) 
        train_put_op_list.append(trainstagingarea.put(train_iterator.get_next())) 
        val_put_op_list.append(valstagingarea.put(val_iterator.get_next())) 
        train_get_op_list.append(trainstagingarea.get()) 
        val_get_op_list.append(valstagingarea.get()) 
        with tf.device('/cpu:0'): 
         worktype = tf.get_variable("wt",[], initializer=tf.zeros_initializer(), trainable=False) 
        workcondition = tf.equal(worktype, 1) 
        #elem = tf.cond(workcondition, lambda: train_iterator.get_next(), lambda: val_iterator.get_next()) 
        elem = tf.cond(workcondition, lambda: train_get_op_list[i], lambda: val_get_op_list[i]) 
        # This is followed by the network construction and optimizer 

現在在執行的時候,我第一次運行put() OPS幾次,然後繼續運行迭代。它如下所示:

with tf.Session(config=config) as sess: 
     sess.run(init_op) 
     sess.run(iterator_training_op) 
     sess.run(iterator_validation_op) 
     sess.run(tf.assign(worktype, 0)) 
     for i in range(4): 
      sess.run(train_put_op_list) 
      sess.run(val_put_op_list) 
     writer = tf.summary.FileWriter('.', graph=tf.get_default_graph()) 
     epoch = 0 
     iter = 0 
     previous = 0 
     while(epoch<10): 
      try: 
       if(PROCESSINGTYPE is 'validation'): 
        sess.run(val_put_op_list) 
        [val_accu, summaries, numsamp] = sess.run([running_accuracy, validation_summary_op, processed]) 
        previous+=numsamp 
        print("Running Accuracy = {} : Number of sample processed = {} ".format(val_accu, previous)) 
       else: 
        sess.run(train_put_op_list) 
        [loss_value, _, train_accu, summaries, batch_accu, numsamp] = sess.run([total_loss, apply_gradient_op, running_accuracy, training_summary_op, batch_accuracy, pr\ 
ocessed]) 
        #Remaining part of the code (not important for question) 

問題描述

使用StagingArea的提高了速度基本上(幾乎3-4倍)。 但是,代碼由於某個塊而掛起。我不確定該塊是否來自get()put()操作。下面是實際的輸出:

# Validation is done first and the following is the output 
Running Accuracy = 0.0 : Number of sample processed = 512 
Running Accuracy = 0.00390625 : Number of sample processed = 1024 
Running Accuracy = 0.0 : Number of sample processed = 1536 
Running Accuracy = 0.001953125 : Number of sample processed = 2048 
# The code hangs here 

你可以注意到,在tf.Session() as sess:開始,get()put() OPS是爲4時間運行。輸出也限制爲4行。這意味着, sess.run(val_put_op_list)while循環不會做任何事情。所以,get()sess.run(running_accuracy)...調用時,在4行後發現StagingArea爲空,因此發生阻塞。

  • 我正確的分析了這個問題嗎?
  • 這裏使用get()put() ops的正確方法是什麼?
  • 如果StagingArea已滿並且put()被阻止,那麼是否也會阻止整個代碼? TensorFlow文檔沒有提到任何有關它的信息。
+0

它就像一個普通的隊列 - 在空的舞臺上使用「get」或在完整的舞臺上使用「put」會掛斷你的session.run。你有沒有看到這個使用例子[tf_cnn_benchmarks.py](https://github.com/tensorflow/benchmarks/blob/master/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py)?請注意,它有額外的邏輯來啓動數據集的隊列運行程序 –

+0

但這裏沒有發生。每個集結區的容量是'5'。在開始4中運行'put()'ops,然後在循環內運行一個'put' op。然後運行'get()'op,然後運行另一個'put' op。 此外,如果你仔細閱讀我的問題,並研究輸出,你會看到我的問題 – Ujjwal

+0

哦,我沒有研究代碼,這是關於你的最後一個問題,當暫存區域已滿 –

回答