2017-02-19 420 views
0

我目前正在用TensorFlow的Python API開發音頻分類器,使用UrbanSound8K數據集,從每個文件中精確收集176400個數據點,並試圖區分10個互斥類。什麼讓logits出現這種意想不到的形狀?

我已經適應了卷積神經網絡這個例子代碼: https://www.tensorflow.org/get_started/mnist/pros

不幸的是,我收到以下錯誤:

Traceback (most recent call last): 
    ... 
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10] 
    [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]] 

During handling of the above exception, another exception occurred: 

Traceback (most recent call last): 
    File "urban-cnn.py", line 124, in <module> 
    sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5}) 
    ... 
tensorflow.python.framework.errors_impl.InvalidArgumentError: logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10] 
    [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]] 

Caused by op 'xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits', defined at: 
    File "urban-cnn.py", line 102, in <module> 
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent") 
    ... 

InvalidArgumentError (see above for traceback): logits and labels must have the same first dimension, got logits shape [7000,10] and labels shape [10] 
    [[Node: xent/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits = SparseSoftmaxCrossEntropyWithLogits[T=DT_FLOAT, Tlabels=DT_INT64, _device="/job:localhost/replica:0/task:0/gpu:0"](read/add, _recv_y_0/_9)]] 

下面是代碼的有一點不同的版本:

import tensorflow as tf 
import soundfile as sfx 
import numpy as np 
import math 
import glob 

batch_size = 10 
n_epochs = 10 

input_width = 176400 

n_labels = 10 

widths = [5, 5, 7] 
channels = [1, 8, 64, 512, n_labels] 

learning_rate = 1e-4 

def load_data(): 
    data_x = [] 
    data_y = [] 

    for path in glob.glob("./UrbanSound8K/audio/fold1/*.wav"): 
     name = path.split("/")[-1].split(".")[0] 
     x, sample_rate = sfx.read(path, frames=input_width, fill_value=0.) 
     y = int(name.split("-")[1]) 

     if x.ndim > 1: 
      x = x.take(0, axis=1) 

     data_x.append(x) 
     data_y.append(y) 

    return data_x, data_y 

data_x, data_y = load_data() 
data_split = int(len(data_x) * .9) 

train_x = data_x[:data_split] 
train_y = data_y[:data_split] 

test_x = data_x[data_split:] 
test_y = data_y[data_split:] 

x = tf.placeholder(tf.float32, [None, input_width], name="x") 
y = tf.placeholder(tf.int64, [None], name="y") 

x_reshaped = tf.reshape(x, [-1, 1, input_width, channels[0]], name="x_reshaped") 

def weights_x(shape, name): 
    w = tf.Variable(tf.truncated_normal(shape, stddev=0.1), name=name) 
    tf.summary.histogram("weights", w) 
    return w 

def weights(layer, name): 
    return weights_x([1, widths[layer], channels[layer], channels[layer+1]], name) 

def biases(layer, name): 
    b = tf.Variable(tf.constant(0.1, shape=[channels[layer+1]]), name=name) 
    tf.summary.histogram("biases", b) 
    return b 

def convolution(p, w, b, name): 
    c = tf.nn.relu(tf.nn.conv2d(p, w, strides=[1, 1, 1, 1], padding="SAME") + b, name=name) 
    tf.summary.histogram("convolution", c) 
    return c 

def pooling(c, name): 
    p = tf.nn.max_pool(c, ksize=[1, 1, 6, 1], strides=[1, 1, 6, 1], padding="SAME", name=name) 
    tf.summary.histogram("pooling", p) 
    return p 

with tf.name_scope("conv1"): 
    w1 = weights(0, "w1") 
    b1 = biases(0, "b1") 
    c1 = convolution(x_reshaped, w1, b1, "c1") 
    p1 = pooling(c1, "p1") 

with tf.name_scope("conv2"): 
    w2 = weights(1, "w2") 
    b2 = biases(1, "b2") 
    c2 = convolution(p1, w2, b2, "c2") 
    p2 = pooling(c2, "p2") 

with tf.name_scope("dens"): 
    n_edges = widths[2] * channels[2] 
    wf1 = weights_x([n_edges, channels[3]], "wf1") 
    bf1 = biases(2, "bf1") 
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1") 
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1") 

with tf.name_scope("drop"): 
    keep_prob = tf.placeholder(tf.float32, name="keep_prob") 
    dropout = tf.nn.dropout(f1, keep_prob) 

with tf.name_scope("read"): 
    wf2 = weights_x([channels[3], channels[4]], "wf2") 
    bf2 = biases(3, "bf2") 
    y_conv = tf.matmul(dropout, wf2) + bf2 

with tf.name_scope("xent"): 
    xent = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, logits=y_conv), name="xent") 
    tf.summary.scalar("xent", xent) 

with tf.name_scope("optimizer"): 
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(xent) 

with tf.name_scope("accuracy"): 
    correct_prediction = tf.equal(tf.argmax(y_conv, 1), y, name="correct_prediction") 
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32), name="accuracy") 
    tf.summary.scalar("accuracy", accuracy) 

with tf.Session() as sess: 
    sess.run(tf.global_variables_initializer()) 
    print("Initialized Global Variables") 

    for epoch in range(n_epochs): 
     n_itr = len(train_x)//batch_size 

     for itr in range(n_itr): 
      left, right = itr*batch_size, (itr+1)*batch_size 
      batch_x, batch_y = train_x[left:right], train_y[left:right] 

      sess.run(optimizer, feed_dict={x: batch_x, y: batch_y, keep_prob: .5}) 
     print("epoch: ", epoch + 1) 

    print("accuracy: ", sess.run(accuracy, feed_dict={x: test_x, y: test_y, keep_prob: 1.})) 

在調用sess.run(...)之前檢查張量形狀時,一切都如預期。

那麼爲什麼logits的形狀是[7000,n_labels]而不是[batch_size,n_labels]?

+0

這不是[7000,n_classes],而是[7000,batch_size](或[7000,last_channel_size]),因爲在代碼中有**一個**類,10是批量大小,而不是n_classes。總的來說,你似乎以非常奇怪的方式保存數據。 – lejlot

+0

@lejlot抱歉,在代碼中它實際上被稱爲n_labels(它是10) – Androbin

+0

@lejlot它不是一個熱點編碼,如果你的意思是說,我正在使用**稀疏_ ** softmax_cross_entropy_with_logits – Androbin

回答

1

你的網絡已經不正確結構,關鍵的問題是在這裏

with tf.name_scope("dens"): 
    n_edges = widths[2] * channels[2] 
    wf1 = weights_x([n_edges, channels[3]], "wf1") 
    bf1 = biases(2, "bf1") 
    pf1 = tf.reshape(p2, [-1, n_edges], name="pf1") 
    f1 = tf.nn.relu(tf.matmul(pf1, wf1) + bf1, name="f1") 

P2具有形狀[10,1,4900,64]和n_edges不等於4900×64 = 313600,而是它448(方式太小層!),如果你讓n_edges = 313600一切都很好,但是這取決於你是否想到了這個架構。它看起來像你合併了兩個不兼容的東西,你使用卷積核的形狀來計算圖層的大小來平滑它。然而,這不是卷積的工作原理 - 圖層的形狀取決於輸入,內核和填充的大小。因此,一般來說,這是方式更大,並在此例中 - 完全連接層實際上應該有超過300k輸入神經元,而不是在你的代碼 - 只有448。這裏至關重要的區別是,這個完全連接層工作在輸出的卷積,而不是參數

這個7000只是運行batch_size *(4900 * 64)/(n_edges)= 10 * 313600/448 = 7000(pf1重塑)的結果。

的更通用的解決方法是做

p2s = p2.get_shape() 
n_edges = int(p2s[1] * p2s[2] * p2s[3]) 
因爲

在這點p2的所有形狀(除了第0)因此公知的可以讀取和用於網絡的其餘部分的結構。

+0

後續行動:您有任何關於如何減少一個不必要的大CNN(我因爲內存錯誤而提出要求...) – Androbin

+0

cnn通常不需要太多的內存(直到存儲激活),你確定它不是由前饋層引起的嗎?它有313600 * 512〜= 160M的參數。如果這是關於CNN部分,那麼增加池中的間距以降低數據分辨率(或者減少內核數量,特別是這512個看起來很多)。 – lejlot

相關問題