2015-11-27 38 views
1

我一直試圖讓張量流在多類kaggle問題上工作。基本上,數據由我已轉換爲所有數字觀測值的6個特徵組成。目標是使用這6個功能來預測出行類型,其中有38種不同的出行類型。我一直試圖用tensorflow來預測這些旅行類型的類。以下代碼是我目前爲止的內容,包括我用來格式化csv文件的內容。代碼將運行,但運行1的輸出開始運行,然後在剩餘運行中輸出相同時輸出很差。以下是在運行狀態下輸出的例子:Tensorflow多級ML模型問題

Run 0,0.268728911877 
Run 1,0.0108088823035 
Run 2,0.0108088823035 
Run 3,0.0108088823035 
Run 4,0.0108088823035 
Run 5,0.0108088823035 
Run 6,0.0108088823035 
Run 7,0.0108088823035 
Run 8,0.0108088823035 
Run 9,0.0108088823035 
Run 10,0.0108088823035 
Run 11,0.0108088823035 
Run 12,0.0108088823035 
Run 13,0.0108088823035 
Run 14,0.0108088823035 

,代碼:

import tensorflow as tf 
import numpy as np 
from numpy import genfromtxt 
import sklearn 
import pandas as pd 
from sklearn.cross_validation import train_test_split 
import sklearn 
# function buildWalMartData takes in a csv file, converts to numpy   array, splits into training 
# and testing, then saves the file to specified target directory 
def buildWalmartData(): 
    df = pd.read_csv('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/full_train_complete.csv') 
    df = df.drop('Unnamed: 0', 1) # 1 specifies axis to remove 
    df_data = np.array(df.drop('TripType', 1).values) # convert to numpy array 
    df_label = np.array(df['TripType'].values) # convert to numpy array 
    X_train, X_test, y_train, y_test = train_test_split(df_data, df_label, test_size=0.25, random_state=50) 
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', 'w') 
    for i,j in enumerate(X_train): 
     k = np.append(np.array(y_train[i]), j) 
     f.write(','.join([str(s) for s in k]) + '\n') 
    f.close() 
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', 'w') 
    for i,j in enumerate(X_test): 
     k=np.append(np.array(y_test[i]), j) 
     f.write(','.join([str(s) for s in k]) + '\n') 
    f.close() 
buildWalmartData() 
# function convertOnehot takes in data and converts to tensorflow oneHot 
# The corresponding labels in Wallmat TripType are numbers between 1 and 38, describing 
# which trip is taken. We have already converted the labels to a one-hot vector, which is a 
# vector that is 0 in most dimensions, and 1 in a single dimension. In this case, the nth triptype 
# will be represented as a vector which is 1 in the nth dimensions. 
def convertOneHot(data): 
    y = np.array([int(i[0]) for i in data]) 
    y_onehot = [0]*len(y) 
    for i,j in enumerate(y): 
     y_onehot[i]=[0]*(y.max()+1) 
     y_onehot[i][j] = 1 
    return (y, y_onehot) 

# import training data 
data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', delimiter=',') 

# import testing data 
test_data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', delimiter=',') 

x_train = np.array([i[1::] for i in data]) 

# example output for x_train: 
#array([[ 7.06940000e+04, 5.00000000e+00, 7.91005185e+09, 
#   1.00000000e+00, 8.00000000e+00, 2.15000000e+02], 
#  [ 1.54653000e+05, 4.00000000e+00, 5.20001225e+09, 
#   1.00000000e+00, 5.00000000e+00, 4.60700000e+03], 
#  [ 1.86178000e+05, 3.00000000e+00, 4.32136106e+09, 
#   -1.00000000e+00, 5.00000000e+01, 1.90000000e+03], 

y_train, y_train_onehot = convertOneHot(data) 

x_test = np.array([ i[1::] for i in test_data]) 
y_test, y_test_onehot = convertOneHot(test_data) 
# exmaple y_test output 
#array([ 5, 32, 24, ..., 31, 28, 5]) 

# and example y_test_onehot: 
#[0,... 
# 0, 
# 0, 
# 0, 
# 0, 
# 0, 
# 0, 
# 1, 
# 0, 
# 0, 
# 0, 
# 0, 
# 0] 


# A is the number of features, 6 in the wallmart data 
# B=38, which is the number of trip types 
A = data.shape[1]-1 
B = len(y_train_onehot[0]) 
tf_in = tf.placeholder('float', [None, A]) # features 
tf_weight = tf.Variable(tf.zeros([A,B])) 
tf_bias = tf.Variable(tf.zeros([B])) 
tf_softmax = tf.nn.softmax(tf.matmul(tf_in, tf_weight) + tf_bias) 

# training via backpropogation 
tf_softmax_correct = tf.placeholder('float', [None, B]) 
tf_cross_entropy = - tf.reduce_sum(tf_softmax_correct*tf.log(tf_softmax)) 

# training using tf.train.GradientDescentOptimizer 
tf_train_step = tf.train.GradientDescentOptimizer(0.01).minimize(tf_cross_entropy) 

# add accuracy nodes 
tf_correct_prediction = tf.equal(tf.argmax(tf_softmax,1),  tf.argmax(tf_softmax_correct, 1)) 
tf_accuracy = tf.reduce_mean(tf.cast(tf_correct_prediction, 'float')) 


# initialize and run 
init = tf.initialize_all_variables() 
sess = tf.Session() 
sess.run(init) 


# running the training 
for i in range(20): 
    sess.run(tf_train_step, feed_dict={tf_in: x_train, tf_softmax_correct: y_train_onehot}) 
    # print accuracy 
    result = sess.run(tf_accuracy, feed_dict={tf_in: x_test, tf_softmax_correct: y_test_onehot}) 
    print "run {},{}".format(i,result) 

關於什麼可能在這裏走錯了,爲什麼運行會變質這樣的任何想法,將不勝感激。謝謝!

+0

這個問題看起來真的很寬泛,如果有人能夠幫助你,我會很驚訝。 – Ross

+0

看看colah和我的答案http://stackoverflow.com/questions/33641799/why-does-tensorflow-example-fail-when-increasing-batch-size 幫助你。 – dga

回答

1

如果您只是想爲Kaggle比賽快速啓動並運行,我建議您先嚐試使用TFLearn中的examples。 embedding_ops是一個熱門的例子,用於早期停止,自定義衰減,更重要的是,您遇到的多類分類/迴歸。一旦你對TensorFlow更加熟悉,你可以很容易地插入TensorFlow代碼來構建你想要的自定義模型(也有這方面的例子)。

+1

,雖然沒有回答他的問題..應該是一個評論.. – Eliethesaiyan