閱讀大CSV文件和飼料到TensorFlow

所以我想讀我的CSV文件到python，然後將數據分成訓練和測試數據（n-fold交叉驗證），然後餵它到我已經制作深度學習架構。然而，閱讀如何在CSV文件，其中顯示在這裏閱讀TensorFlow教程後：閱讀大CSV文件和飼料到TensorFlow

filename_queue = tf.train.string_input_producer(["file0.csv", "file1.csv"]) 

reader = tf.TextLineReader() 
key, value = reader.read(filename_queue) 

# Default values, in case of empty columns. Also specifies the type of the 
# decoded result. 
record_defaults = [[1], [1], [1], [1], [1]] 
col1, col2, col3, col4, col5 = tf.decode_csv(
    value, record_defaults=record_defaults) 
features = tf.pack([col1, col2, col3, col4]) 

with tf.Session() as sess: 
    # Start populating the filename queue. 
    coord = tf.train.Coordinator() 
    threads = tf.train.start_queue_runners(coord=coord) 

    for i in range(1200): 
    # Retrieve a single instance: 
    example, label = sess.run([features, col5]) 

    coord.request_stop() 
    coord.join(threads)

一切纔有意義在此代碼，除了在與for循環結束的部分。

問題1：1200 for循環的意義是什麼？數據中的記錄數是多少？

有關代碼拌和例子如下教程會談的下一個部分：

def read_my_file_format(filename_queue): 
    reader = tf.SomeReader() 
    key, record_string = reader.read(filename_queue) 
    example, label = tf.some_decoder(record_string) 
    processed_example = some_processing(example) 
    return processed_example, label 

def input_pipeline(filenames, batch_size, num_epochs=None): 
    filename_queue = tf.train.string_input_producer(
     filenames, num_epochs=num_epochs, shuffle=True) 
    example, label = read_my_file_format(filename_queue) 
    # min_after_dequeue defines how big a buffer we will randomly   sample 
    # from -- bigger means better shuffling but slower start up and  more 
    # memory used. 
    # capacity must be larger than min_after_dequeue and the amount larger 
    # determines the maximum we will prefetch. Recommendation: 
    # min_after_dequeue + (num_threads + a small safety margin) *  batch_size 
    min_after_dequeue = 10000 
    capacity = min_after_dequeue + 3 * batch_size 
    example_batch, label_batch = tf.train.shuffle_batch(
     [example, label], batch_size=batch_size, capacity=capacity, 
     min_after_dequeue=min_after_dequeue) 
    return example_batch, label_batch

我明白，這是異步代碼塊，直到它接收到的一切。在代碼運行後查看示例和標籤的值時，我發現每個數據只保存數據中特定記錄的信息。

問題2：「read_my_file」下的代碼是否應該與我發佈的第一個代碼塊相同？然後是input_pipeline函數將單個記錄一起批量處理到某個batch_size中？如果read_my_file函數與第一個代碼塊相同，爲什麼不存在相同的循環（這可以回到我的第一個問題）

我很感激任何澄清，因爲這是我第一次使用TensorFlow 。謝謝您的幫助！

來源

2016-06-10 Chandra_Rathnam

（1）1200是任意的 - 我們應該修正這個例子，以便在那裏使用一個命名常量來使其更清晰。感謝您的發現。 :)隨着the CSV reading example的設置方式，繼續讀取將通過兩個CSV文件多次讀取（string_input_producer持有的文件名沒有提供num_epochs參數，所以它默認爲永久循環）。所以1200就是程序員在示例中選擇檢索的記錄數。

如果您只想讀取文件中的示例數量，則可以捕獲OutOfRangeError，如果輸入器用完輸入，或者讀取的記錄數完全相同，則會引發OutOfRangeError。有一個新的閱讀操作正在進行中，這也有助於簡化操作，但我認爲它不包含在0.9中。（2）它應該建立一個非常相似的操作集，但實際上並不是閱讀。請記住，你用Python編寫的大部分內容都是構建一個圖形，這是TensorFlow將執行的一系列操作。因此，read_my_file中的內容幾乎是tf.Session()創建之前的內容。在上面的例子中，for循環中的代碼實際上正在執行tf圖來將示例提取回python。但在示例的第二部分，您只需設置管道將項目讀入Tensors，然後添加額外的消耗這些張量的操作並執行一些有用的操作 - 在這種情況下，將它們投入隊列以創建更大批次，這些批次本身很可能會被其他TF代理商隨後使用。

來源

2016-06-10 20:10:29 dga

這很有道理！所以還有2個問題。 Q1：如果我有100條記錄，並且我的培訓批量大小爲80，那麼我可以在input_pipeline中返回80條記錄（使用80作爲batch_size參數吧？），然後跟蹤其他20條記錄以進行測試嗎？基本上，你知道一種方法，我可以跟蹤哪些80我用於訓練，所以我可以用其餘的測試（當然是洗牌後）。問題2：基本上，當我在我的代碼（位於另一個文件中）初始化並運行Session來訓練，測試等時，我應該在調用input_pipeline之後輸入數據？謝謝！ –

閱讀大CSV文件和飼料到TensorFlow

回答

相關問題