如何使用TensorFlow tf.train.string_input_producer生成多個時代數據？

當我想用tf.train.string_input_producer加載數據2個時代，我用如何使用TensorFlow tf.train.string_input_producer生成多個時代數據？

filename_queue = tf.train.string_input_producer(filenames=['data.csv'], num_epochs=2, shuffle=True) 

col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch([col1, col2, col3], batch_size=batch_size, capacity=capacity,\min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True)

但後來我發現，這種運算沒有產生我想要的。

它只能生成data.csv中的每個樣品2次，但生成的順序不明確。例如，3個data.csv

[[1] 
[2] 
[3]]

線數據就會產生（其中每個樣品只出現2次，但該命令是可選的）

[1] 
[1] 
[3] 
[2] 
[2] 
[3]

但我想是（每個曆元是分開，洗牌在每個時間段）

此外，如何知道什麼時候1個時代做？有一些標誌變量嗎？謝謝！

我的代碼在這裏。

import tensorflow as tf 

def read_my_file_format(filename_queue): 
    reader = tf.TextLineReader() 
    key, value = reader.read(filename_queue) 
    record_defaults = [['1'], ['1'], ['1']] 
    col1, col2, col3 = tf.decode_csv(value, record_defaults=record_defaults, field_delim='-') 
    # col1 = list(map(int, col1.split(','))) 
    # col2 = list(map(int, col2.split(','))) 
    return col1, col2, col3 

def input_pipeline(filenames, batch_size, num_epochs=1): 
    filename_queue = tf.train.string_input_producer(
    filenames, num_epochs=num_epochs, shuffle=True) 
    col1,col2,col3 = read_my_file_format(filename_queue) 

    min_after_dequeue = 10 
    capacity = min_after_dequeue + 3 * batch_size 
    col1_batch, col2_batch, col3_batch = tf.train.shuffle_batch(
    [col1, col2, col3], batch_size=batch_size, capacity=capacity, 
    min_after_dequeue=min_after_dequeue, allow_smaller_final_batch=True) 
    return col1_batch, col2_batch, col3_batch 

filenames=['1.txt'] 
batch_size = 3 
num_epochs = 1 
a1,a2,a3=input_pipeline(filenames, batch_size, num_epochs) 

with tf.Session() as sess: 
    sess.run(tf.local_variables_initializer()) 
    # start populating filename queue 
    coord = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(coord=coord) 
    try: 
    while not coord.should_stop(): 
     a, b, c = sess.run([a1, a2, a3]) 
     print(a, b, c) 
    except tf.errors.OutOfRangeError: 
    print('Done training, epoch reached') 
    finally: 
    coord.request_stop() 

    coord.join(threads)

我的數據是一樣

1,2-3,4-A 
7,8-9,10-B 
12,13-14,15-C 
17,18-19,20-D 
22,23-24,25-E 
27,28-29,30-F 
32,33-34,35-G 
37,38-39,40-H

來源

2017-06-14 danche

您可以添加生成張量'col1'，'col2'，'col3'的代碼？代碼被寫入的方式表明你在流水線結束時洗牌，因此它將全部混在一起 – MZHm

我添加了我的代碼和數據。@ MZHm – danche

你可能想看看這個答案，看看是否有類似的問題： https://stackoverflow.com/a/44526962/4282745 – npf

由於Nicolas observes的tf.train.string_input_producer() API不給你檢測達到一個時代的結束時的能力;相反，它將所有時代連接成一個長批次。爲此，我們最近添加了（在TensorFlow 1.2中）tf.contrib.data API，這使得可以表達更復雜的流水線，包括您的用例。

下面的代碼片段顯示瞭如何使用tf.contrib.data編寫程序：

import tensorflow as tf 

def input_pipeline(filenames, batch_size): 
    # Define a `tf.contrib.data.Dataset` for iterating over one epoch of the data. 
    dataset = (tf.contrib.data.TextLineDataset(filenames) 
       .map(lambda line: tf.decode_csv(
        line, record_defaults=[['1'], ['1'], ['1']], field_delim='-')) 
       .shuffle(buffer_size=10) # Equivalent to min_after_dequeue=10. 
       .batch(batch_size)) 

    # Return an *initializable* iterator over the dataset, which will allow us to 
    # re-initialize it at the beginning of each epoch. 
    return dataset.make_initializable_iterator() 

filenames=['1.txt'] 
batch_size = 3 
num_epochs = 10 
iterator = input_pipeline(filenames, batch_size) 

# `a1`, `a2`, and `a3` represent the next element to be retrieved from the iterator.  
a1, a2, a3 = iterator.get_next() 

with tf.Session() as sess: 
    for _ in range(num_epochs): 
     # Resets the iterator at the beginning of an epoch. 
     sess.run(iterator.initializer) 

     try: 
      while True: 
       a, b, c = sess.run([a1, a2, a3]) 
       print(a, b, c) 
     except tf.errors.OutOfRangeError: 
      # This will be raised when you reach the end of an epoch (i.e. the 
      # iterator has no more elements). 
      pass     

     # Perform any end-of-epoch computation here. 
     print('Done training, epoch reached')

來源

2017-06-14 17:46:01 mrry

爲什麼我們使用控制流的異常？（即'tf.errors.OutOfRangeError'除外） – MZHm

異常是TensorFlow當前有信號表明所請求的值尚未計算的唯一機制。（它類似於Python如何使用StopIteration異常來在自己的迭代器協議中指示迭代器的結束）。當然可以將它包裝在某些庫代碼中，並且我提出了一種在[this GitHub評論]（https://github.com/tensorflow/tensorflow/issues/7951#issuecomment-303546037）。 – mrry

爲什麼不簡單'而不是sess.run（epoch_done）：...'？ 'epoch_done'是一個由隊列設置的變量，由'iterator.initializer'重置。 – MZHm

你可能想看看這個answer到類似的問題。

的短篇小說是：

如果num_epochs> 1，所有的數據都在同一時間排隊和獨立suffled的時代，
，所以你沒有監視哪個時代正在出列的能力。

你可以做的是在所列出的答案，這是在每次運行與num_epochs == 1的工作，並重新初始化本地隊列變量（和顯然不是模型變量）的第一個建議。

init_queue = tf.variables_initializer(tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope='input_producer')) 
with tf.Session() as sess: 
    sess.run(tf.global_variables_initializer()) 
    sess.run(tf.local_variables_initializer()) 
for e in range(num_epochs): 
    with tf.Session() as sess: 
     sess.run(init_queue) # reinitialize the local variables in the input_producer scope 
     # start populating filename queue 
     coord = tf.train.Coordinator() 
     threads = tf.train.start_queue_runners(coord=coord) 
     try: 
      while not coord.should_stop(): 
       a, b, c = sess.run([a1, a2, a3]) 
       print(a, b, c) 
     except tf.errors.OutOfRangeError: 
      print('Done training, epoch reached') 
     finally: 
      coord.request_stop() 

     coord.join(threads)

來源

2017-06-14 16:39:05 npf

再次感謝。我之前嘗試過這個解決方案，但我認爲它還不夠優雅：P。也許這是最實際的方法，我認爲應該增加一些參數來解決這個問題。 – danche

以這種方式，我需要初始變量的每個時期，但是這個運算會產生一些其他問題給模型，對吧？ – danche

我同意。無論如何根據這個評論：https://github.com/tensorflow/tensorflow/issues/4535#issuecomment-283181862 隊列不是我們將來處理數據的方式。 – npf

如何使用TensorFlow tf.train.string_input_producer生成多個時代數據？

回答

相關問題