2017-06-14 20 views
0

當我想用tf.train.string_input_producer加載數據2個時代,我用如何使用TensorFlow tf.train.string_input_producer生成多個時代數據?

filename_queue = tf.train.string_input_producer(filenames=['data.csv'], num_epochs=2, shuffle=True) 

col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch([col1, col2, col3], batch_size=batch_size, capacity=capacity,\min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True) 

但後來我發現,這種運算沒有產生我想要的。

它只能生成data.csv中的每個樣品2次,但生成的順序不明確。例如,3個data.csv

[[1] 
[2] 
[3]] 

線數據就會產生(其中每個樣品只出現2次,但該命令是可選的)

[1] 
[1] 
[3] 
[2] 
[2] 
[3] 

但我想是(每個曆元是分開,洗牌在每個時間段)

​​

此外,如何知道什麼時候1個時代做?有一些標誌變量嗎?謝謝!

我的代碼在這裏。

import tensorflow as tf 

def read_my_file_format(filename_queue): 
    reader = tf.TextLineReader() 
    key, value = reader.read(filename_queue) 
    record_defaults = [['1'], ['1'], ['1']] 
    col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defaults, field_delim='-') 
    # col1 = list(map(int, col1.split(','))) 
    # col2 = list(map(int, col2.split(','))) 
    return col1, col2, col3 

def input_pipeline(filenames, batch_size, num_epochs=1): 
    filename_queue = tf.train.string_input_producer(
    filenames, num_epochs=num_epochs, shuffle=True) 
    col1,col2,col3 = read_my_file_format(filename_queue) 

    min_after_dequeue = 10 
    capacity = min_after_dequeue + 3 * batch_size 
    col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch(
    [col1, col2, col3], batch_size=batch_size, capacity=capacity, 
    min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True) 
    return col1_batch, col2_batch, col3_batch 

filenames=['1.txt'] 
batch_size = 3 
num_epochs = 1 
a1,a2,a3=input_pipeline(filenames, batch_size, num_epochs) 

with tf.Session() as sess: 
    sess.run(tf.local_variables_initializer()) 
    # start populating filename queue 
    coord = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(coord=coord) 
    try: 
    while not coord.should_stop(): 
     a, b, c = sess.run([a1, a2, a3]) 
     print(a, b, c) 
    except tf.errors.OutOfRangeError: 
    print('Done training, epoch reached') 
    finally: 
    coord.request_stop() 

    coord.join(threads) 

我的數據是一樣

1,2-3,4-A 
7,8-9,10-B 
12,13-14,15-C 
17,18-19,20-D 
22,23-24,25-E 
27,28-29,30-F 
32,33-34,35-G 
37,38-39,40-H 
+0

您可以添加生成張量'col1','col2','col3'的代碼?代碼被寫入的方式表明你在流水線結束時洗牌,因此它將全部混在一起 – MZHm

+1

我添加了我的代碼和數據。@ MZHm – danche

+0

你可能想看看這個答案,看看是否有類似的問題: https://stackoverflow.com/a/44526962/4282745 – npf

回答

6

由於Nicolas observestf.train.string_input_producer() API不給你檢測達到一個時代的結束時的能力;相反,它將所有時代連接成一個長批次。爲此,我們最近添加了(在TensorFlow 1.2中)tf.contrib.data API,這使得可以表達更復雜的流水線,包括您的用例。

下面的代碼片段顯示瞭如何使用tf.contrib.data編寫程序:

import tensorflow as tf 

def input_pipeline(filenames, batch_size): 
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data. 
    dataset = (tf.contrib.data.TextLineDataset(filenames) 
       .map(lambda line: tf.decode_csv(
        line, record_defaults=[['1'], ['1'], ['1']], field_delim='-')) 
       .shuffle(buffer_size=10) # Equivalent to min_after_dequeue=10. 
       .batch(batch_size)) 

    # Return an *initializable* iterator over the dataset, which will allow us to 
    # re-initialize it at the beginning of each epoch. 
    return dataset.make_initializable_iterator() 

filenames=['1.txt'] 
batch_size = 3 
num_epochs = 10 
iterator = input_pipeline(filenames, batch_size) 

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator.  
a1, a2, a3 = iterator.get_next() 

with tf.Session() as sess: 
    for _ in range(num_epochs): 
     # Resets the iterator at the beginning of an epoch. 
     sess.run(iterator.initializer) 

     try: 
      while True: 
       a, b, c = sess.run([a1, a2, a3]) 
       print(a, b, c) 
     except tf.errors.OutOfRangeError: 
      # This will be raised when you reach the end of an epoch (i.e. the 
      # iterator has no more elements). 
      pass     

     # Perform any end-of-epoch computation here. 
     print('Done training, epoch reached') 
+1

爲什麼我們使用控制流的異常? (即'tf.errors.OutOfRangeError'除外) – MZHm

+2

異常是TensorFlow當前有信號表明所請求的值尚未計算的唯一機制。 (它類似於Python如何使用StopIteration異常來在自己的迭代器協議中指示迭代器的結束)。當然可以將它包裝在某些庫代碼中,並且我提出了一種在[this GitHub評論](https://github.com/tensorflow/tensorflow/issues/7951#issuecomment-303546037)。 – mrry

+2

爲什麼不簡單'而不是sess.run(epoch_done):...'? 'epoch_done'是一個由隊列設置的變量,由'iterator.initializer'重置。 – MZHm

2

你可能想看看這個answer到類似的問題。

的短篇小說是:

  • 如果num_epochs> 1,所有的數據都在同一時間排隊和獨立suffled的時代,

  • ,所以你沒有監視哪個時代正在出列的能力。

你可以做的是在所列出的答案,這是在每次運行與num_epochs == 1的工作,並重新初始化本地隊列變量(和顯然不是模型變量)的第一個建議。

init_queue = tf.variables_initializer(tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope='input_producer')) 
with tf.Session() as sess: 
    sess.run(tf.global_variables_initializer()) 
    sess.run(tf.local_variables_initializer()) 
for e in range(num_epochs): 
    with tf.Session() as sess: 
     sess.run(init_queue) # reinitialize the local variables in the input_producer scope 
     # start populating filename queue 
     coord = tf.train.Coordinator() 
     threads = tf.train.start_queue_runners(coord=coord) 
     try: 
      while not coord.should_stop(): 
       a, b, c = sess.run([a1, a2, a3]) 
       print(a, b, c) 
     except tf.errors.OutOfRangeError: 
      print('Done training, epoch reached') 
     finally: 
      coord.request_stop() 

     coord.join(threads) 
+0

再次感謝。我之前嘗試過這個解決方案,但我認爲它還不夠優雅:P。也許這是最實際的方法,我認爲應該增加一些參數來解決這個問題。 – danche

+0

以這種方式,我需要初始變量的每個時期,但是這個運算會產生一些其他問題給模型,對吧? – danche

+0

我同意。無論如何根據這個評論:https://github.com/tensorflow/tensorflow/issues/4535#issuecomment-283181862 隊列不是我們將來處理數據的方式。 – npf

相關問題