2017-02-09 154 views
1

我正在爲字符串數據的令牌分類實現一個convnet。 I 需要從TFRecord中取出字符串數據,批量洗牌,然後執行一些擴展數據的處理,然後再批量處理。這是可能的兩個batch_shuffle操作?雙批處理Tensorflow輸入數據

這是我需要做的:

  1. 排隊文件名成開fileQueue
  2. 每個序列化實例,放到一個shuffle_batch
  3. 當我決絕的洗牌批次中的每個例子中,我需要按照序列長度複製它,協調位置向量,這將爲第一批的每個原始示例創建多個示例。我需要再次批量處理。

當然,一個解決方法就是預處理加載到TF之前的數據,但會佔用更多的方式比磁盤空間是必要的。

DATA

下面是一些示例數據。我有兩個「例子」。各實施例包含一個標記化的句子和標籤爲每個令牌的特徵:

sentences = [ 
      [ 'the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog' '.'], 
      ['then', 'the', 'lazy', 'dog', 'slept', '.'] 
      ] 
sent_labels = [ 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O', 'O', 'ANIMAL', 'O'], 
      ['O', 'O', 'O', 'ANIMAL', 'O', 'O'] 
      ] 

每個「實施例」現在具有特徵如下(一些reducution爲了清楚):

features { 
    feature { 
    key: "labels" 
    value { 
     bytes_list { 
     value: "O" 
     value: "O" 
     value: "O" 
     value: "ANIMAL" 
     ... 
     } 
    } 
    } 

    feature { 
    key: "sentence" 
    value { 
     bytes_list { 
     value: "the" 
     value: "quick" 
     value: "brown" 
     value: "fox" 
     ... 
     } 
    } 
    } 
} 

轉化

批處理稀疏數據後,我收到一個作爲令牌列表的句子:

['the','quick','brown','fo X」,...]

我需要PAD列表第一至預定SEQ_LEN,然後插入 位置索引到每個例子中,旋轉的位置,使得 托克欲分類是在pos 0,並且每個位置標記是相對 0位置:

[ 
['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4] # classify 'the' 
['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] # classify 'quick 
['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] # classify 'brown 
['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] # classify 'fox 
] 

配料和ReBatching數據

這裏是什麼,我試圖做一個簡化版本:

# Enqueue the Filenames and serialize 
filenames =[outfilepath] 
fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True, name='FQ') 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue Examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
initial_batch = tf.train.shuffle_batch([serialized_example], batch_size=1, capacity, min_after_dequeue) 


# Parse Sparse Tensors, make into single dense Tensor 
# ['the', 'quick', 'brown', 'fox'] 
parsed = tf.parse_example(data_batch, features=feature_mapping) 
dense_tensor_sentence = tf.sparse_tensor_to_dense(parsed['sentence'], default_value='<PAD>') 
sent_len = tf.shape(dense_tensor_sentence)[1] 

SEQ_LEN = 5 
NUM_PADS = SEQ_LEN - sent_len 
#['the', 'quick', 'brown', 'fox', 'PAD'] 
padded_sentence = pad(dense_tensor_sentence, NUM_PADS) 

# make sent_len X SEQ_LEN copy of sentence, position vectors 
#[ 
# ['the', 0 , 'quick', 1 , 'brown', 2 , 'fox', 3, 'PAD', 4 ] 
# ['the', -1, 'quick', 0 , 'brown', 1 , 'fox', 2 'PAD', 3 ] 
# ['the', -2, 'quick', -1, 'brown', 0 , 'fox', 1 'PAD', 2 ] 
# ['the', -3, 'quick', -2, 'brown', -1, 'fox', 0 'PAD', 1 ] 
# NOTE: There is no row where PAD is with a position 0, because I don't 
# want to classify the PAD token 
#] 
examples_with_positions = replicate_and_insert_positions(padded_sentence) 

# While my SEQ_LEN will be constant, the sent_len will not. Therefore, 
#I don't know the number of rows, but I can guarantee the number of 
# columns. shape = (?,SEQ_LEN) 

dynamic_input = final_reshape(examples_with_positions) # shape = (?, SEQ_LEN) 

# Try Random Shuffle Queue: 

# Rebatch <-- This is where the problem is 
#reshape_concat.set_shape((None, SEQ_LEN)) 

random_queue = tf.RandomShuffleQueue(10000, 50, [tf.int64], shapes=(SEQ_LEN,)) 
random_queue.enqueue_many(dynamic_input) 
batch = random_queue.dequeue_many(4) 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print sess.run(batch) 

    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs." 

編輯

我現在嘗試使用RandomShuffleQueue。在每個隊列中,我想排列一個具有形狀的批處理(無,SEQ_LEN)。我修改了上面的代碼來反映這一點。

我不再獲得關於輸入形狀投訴,但排隊掛確實在sess.run(batch)

+1

只是想了解。你第二次批量生產時,你想把這些位置矩陣分成多個句子,對嗎?那些不會有不同的長度,在這種情況下,將它們分配在一個密集的張量中是不可能的? –

+0

對不起,我忘了提及我將每個輸入PAD到一個常量SEQ_LEN。我重寫了代碼示例,希望能夠澄清這些問題。我得到一個句子,填入它,然後平鋪和重塑句子,使得每個記號與一個位置矢量連接。第二批的輸入將是shape =(sent_len,SEQ_LEN)。但是因爲我不知道sent_len,我不能使用QueueRunners – Neal

+1

在這種情況下'enqueue_many'是你想要的嗎?然後批處理(sent_len_1 + sent_len_2 + ...,SEQ_LEN)。 'enqueue_many'的批量維度不應該需要靜態形狀信息(只要確保其餘維度具有靜態形狀信息)。 –

回答

1

我被錯誤地接近整個問題。我錯誤地以爲我必須在插入tf.batch_shuffle時定義批次的完整形狀,但實際上我只需要定義我輸入的每個元素的形狀,並設置enqueue_many=True

下面是正確的代碼:

single_batch=1 
input_batch_size = 64 
min_after_dequeue = 10 
capacity = min_after_dequeue + 3 * input_batch_size 
num_epochs=2 
SEQ_LEN = 10 
filenames =[outfilepath] 

fq = tf.train.string_input_producer(filenames, num_epochs=num_epochs, shuffle=True) 
reader = tf.TFRecordReader() 
key, serialized_example = reader.read(fq) 

# Dequeue examples of batch_size == 1. Because all examples are Sparse Tensors, do 1 at a time 
first_batch = tf.train.shuffle_batch([serialized_example], ONE, capacity, min_after_dequeue) 

# Get a single sentence and preprocess it shape=(sent_len) 
single_sentence = tf.parse_example(first_batch, features=feature_mapping) 

# Preprocess Sentence. shape=(sent_len, SEQ_LEN * 2). Each row is example 
processed_inputs = preprocess(single_sentence) 

# Re batch 
input_batch = tf.train.shuffle_batch([processed_inputs], 
       batch_size=input_batch_size, 
       capacity=capacity, min_after_dequeue=min_after_dequeue, 
       shapes=[SEQ_LEN * 2], enqueue_many=True) #<- This is the fix 


init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer(), tf.initialize_all_tables()) 

sess = create_session() 
sess.run(init_op) 

#tf.get_default_graph().finalize() 
coord = tf.train.Coordinator() 
threads = tf.train.start_queue_runners(sess=sess, coord=coord) 

try: 
    i = 0 
    while True: 
    print i  
    print sess.run(input_batch) 
    i += 1 
except tf.errors.OutOfRangeError as e: 
    print "No more inputs."