我試圖在tensorflow與40GB內存運行一些簡單的卷積神經網絡在Windows 10,使用CPU的版本。然而,到目前爲止,我仍然遇到執行問題,或者在初始化變量後或者在幾次訓練迭代後掛起。以下是我的代碼和我想實現的內容的摘要。tensorflow執行凍結一小CNN
我有每個圖像中的五個字母的順序,我想訓練CNN識別每幀圖像的序列。要做到這一點,我有兩個卷積層(高度/寬度/通道:4/4/5,4/4/10),每個捲入一個Relu層,然後是兩個完全連接的Relu層,熵損失函數。
num_image = 5
image_size = (28, 150)
out_channel = 5;
shape_conv1 = [4, 4, 1, out_channel] # height, width, in_channel, out_channel
stride_conv1 = [1, 2, 2, 1]
shape_conv2 = [4, 4, out_channel, out_channel*2] # height, width, in_channel, out_channel
stride_conv2 = [1, 2, 2, 1]
num_layer1 = 100
num_layer2 = 100
num_output = 10
num_batch = 200
size_intermediate = [1, np.ceil(np.ceil(image_size[0]/stride_conv1[1])/stride_conv2[1]), \
np.ceil(np.ceil(image_size[1]/stride_conv1[2])/stride_conv2[2]), out_channel*2]
size_trans = [int(i) for i in size_intermediate]
with graph.as_default():
input_data = tf.placeholder(tf.float32, [num_batch-num_image+1, image_size[0], image_size[1], 1])
input_labels = tf.placeholder(tf.float32, [num_image, num_batch-num_image+1, num_output])
reg_coeff = tf.placeholder(tf.float32)
weights_conv1 = tf.Variable(tf.truncated_normal(shape_conv1, 0.0, 0.1))
bias_relu1 = tf.Variable(tf.zeros([out_channel]))
weights_conv2 = tf.Variable(tf.truncated_normal(shape_conv2, 0.0, 0.1))
bias_relu2 = tf.Variable(tf.zeros([out_channel*2]))
weights_layer1 = tf.Variable(tf.truncated_normal(\
[num_image, size_trans[1]*size_trans[2]*size_trans[3], num_layer1], \
0.0, (num_layer1)**-0.5))
bias_layer1 = tf.zeros([num_image, 1, num_layer1])
weights_layer2 = tf.Variable(tf.truncated_normal([num_image, num_layer1, num_layer2], \
0.0, (num_layer2)**-0.5))
bias_layer2 = tf.zeros([num_image, 1, num_layer2])
weights_output = tf.Variable(tf.truncated_normal([num_image, num_layer2, num_output], 0.0, num_output**-0.5))
bias_output = tf.zeros([num_image, 1, num_output])
output_conv1 = tf.nn.conv2d(input_data, weights_conv1, stride_conv1, "SAME")
output_relu1 = tf.nn.relu(output_conv1 + bias_relu1)
output_conv2 = tf.nn.conv2d(output_relu1, weights_conv2, stride_conv2, "SAME")
output_relu2 = tf.nn.relu(output_conv2 + bias_relu2)
shape_inter = output_relu2.get_shape().as_list()
input_inter = tf.reshape(output_relu2, [1, shape_inter[0], shape_inter[1]*shape_inter[2]*shape_inter[3]])
## One copy for each letter recognizer
input_mid = tf.tile(input_inter, [num_image, 1, 1])
input_layer1 = tf.matmul(input_mid, weights_layer1) + bias_layer1
output_layer1 = tf.nn.relu(input_layer1)
input_layer2 = tf.matmul(output_layer1, weights_layer2) + bias_layer2
output_layer2 = tf.nn.relu(input_layer2)
logits = tf.matmul(output_layer2, weights_output) + bias_output
# Training prediction
train_prediction = tf.nn.softmax(logits)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits, input_labels))
# Loss term for regularization
loss_reg = reg_coeff*(tf.nn.l2_loss(weights_layer1)+tf.nn.l2_loss(bias_layer1)\
+tf.nn.l2_loss(weights_layer2)+tf.nn.l2_loss(bias_layer2)\
+tf.nn.l2_loss(weights_output)+tf.nn.l2_loss(bias_output))
learning_rate = 0.1
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(loss+loss_reg)
的CNN是相當簡單,並且相當小,所以當我看到它初始化所有變量後凍結,或充其量幾個訓練後跑我感到相當吃驚。沒有任何輸出,並且ctrl + c不會中斷執行。我想知道它是否可以與Tensorflow的Windows版本有任何關係,但我目前處於無法尋找線索的地步。
有人能分享什麼可以引起我的問題,他們的建議/意見?謝謝!
編輯: 正如在評論中指出,有可能是我喂的數據模型的方式的問題。因此我也發佈了下面的代碼部分。
num_steps = 20000
fixed_input = np.random.randint(0, 256, [num_batch-num_image+1, 28, 150, 1])
fixed_label = np.tile((np.random.choice(10, [num_batch-num_image+1, 1])==np.arange(10)).astype(np.float32), (5, 1, 1))
with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print("Initialized")
loss1 = 0.0
loss2 = 0.0
for i in range(1, num_steps+1):
feed_dict = {input_data : fixed_input, input_labels : fixed_label, reg_coeff : 2e-4}
_, l1, l2, predictions = session.run([optimizer, loss, loss_reg, train_prediction], feed_dict=feed_dict)
loss1 += l1
loss2 += l2
if i % 500 == 0:
print("Batch/reg loss at step %d: %f, %f" % (i, loss1/500, loss2/500))
loss1 = 0.0
loss2 = 0.0
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, fixed_labels))
我只是使用隨機輸入及其標籤來測試代碼是否運行。不幸的是,執行再次凍結在培訓的前幾個步驟。
該模型本身看起來不錯。有兩種可能性:(a)將輸入提供給並運行圖的代碼有問題,而您沒有顯示,或者(b)Windows上的Tensorflow存在錯誤。解決這個問題的一種方法是嘗試使用隨機或常量輸入來運行模型,而不使用任何代碼來讀取輸入。幾步之後它仍然掛起?如果是這樣,那麼你應該提交Github問題。如果沒有,那麼輸入閱讀代碼有問題---你能證明嗎?希望有所幫助! –
@PeterHawkins謝謝你的建議。我已經發布了我用來進行培訓的代碼。不幸的是,即使數據不變,訓練仍然停滯不前。除非我缺少一些基本的東西,似乎我可能確實遇到了張量流的一些問題...... – bagend2001
這聽起來像是一個特定於Windows的bug。這聽起來像你最好的選擇是提交一個Github問題。請使用Tensorflow Github提供的最小的自包含的複製代碼並提交問題。 (您的複製代碼越小越簡單,解決問題的概率就越高)。 –