2017-07-20 25 views
0

到目前爲止,我試圖爲情感分析實現適合生成器,因爲我只有一個小的PGU和大數據集。不過,我不斷收到此錯誤Keras fit_generator()&輸入數組應該與目標示例相同

Using Theano backend. 
Can not use cuDNN on context None: cannot compile with cuDNN. We got this error: 
b'In file included from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/driver_types.h:53:0,\r\n     from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/cudnn.h:63,\r\n     from C:\\Users\\Def\\AppData\\Local\\Temp\\try_flags_p2iwer2o.c:4:\r\nC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/host_defines.h:84:0: warning: "__cdecl" redefined\r\n #define __cdecl\r\n ^\r\n<built-in>: note: this is the location of the previous definition\r\nd000029.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler\'\r\nd000026.o:(.idata$5+0x0): first defined here\r\nC:/Users/Def/Anaconda3/envs/Final/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup\':\r\nC:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:285: undefined reference to `_set_invalid_parameter_handler\'\r\ncollect2.exe: error: ld returned 1 exit status\r\n' 
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0) 
Epoch 1/10 
Traceback (most recent call last): 
    File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 136, in <module> 
    epochs=10) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper 
    return func(*args, **kwargs) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\models.py", line 1097, in fit_generator 
    initial_epoch=initial_epoch) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper 
    return func(*args, **kwargs) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1876, in fit_generator 
    class_weight=class_weight) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1614, in train_on_batch 
    check_batch_axis=True) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1307, in _standardize_user_data 
    _check_array_lengths(x, y, sample_weights) 
    File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 229, in _check_array_lengths 
    'and ' + str(list(set_y)[0]) + ' target samples.') 
ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 1 target samples. 

我有一個矩陣就是1000元長久以來我只擁有這是在標記生成器()指定的1000個字的最大語料庫。

然後我有情緒,這是一個0爲負面或1爲正面。

我的問題是爲什麼我收到錯誤?我試圖對數據和標籤使用轉換,但仍然收到相同的錯誤。這是我的代碼。

from keras.models import Sequential 
from keras.layers import Dense, Dropout 
from keras.preprocessing.text import Tokenizer 
import numpy as np 
import pandas as pd 
import pickle 
import matplotlib.pyplot as plt 
import re 

""" 
the amount of samples out to the 1 million to use, my 960m 2GB can only handle 
about 30,000ish at the moment depending on a number of neurons in the 
deep layer and a number of layers. 
""" 
maxSamples = 3000 

#Load the CSV and get the correct columns 
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv") 
dx = pd.DataFrame() 
dy = pd.DataFrame() 
dy[['Sentiment']] = data[['Sentiment']] 
dx[['SentimentText']] = data[['SentimentText']] 

dataY = dy.iloc[0:maxSamples] 
dataX = dx.iloc[0:maxSamples] 

testY = dy.iloc[maxSamples: maxSamples + 1000] 
testX = dx.iloc[maxSamples: maxSamples + 1000] 


""" 
here I filter the data and clean it up by removing @ tags, hyperlinks and 
also any characters that are not alpha-numeric. 
""" 
def removeTagsAndLinks(dataframe): 
    for x in dataframe.iterrows(): 
     #Removes Hyperlinks 
     x[1].values[0] = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?", "", str(x[1].values[0])) 
     #Removes @ tags 
     x[1].values[0] = re.sub("@\\w+", '', str(x[1].values[0])) 
     #keeps only alpha-numeric chars 
     x[1].values[0] = re.sub("\W+", ' ', str(x[1].values[0])) 
    return dataframe 

xData = removeTagsAndLinks(dataX) 
xTest = removeTagsAndLinks(testX) 

""" 
This loop looks for any Tweets with characters shorter than 2 and once found write the 
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the 
list of Tweets later 
""" 
indexOfBlankStrings = [] 
for index, string in enumerate(xData): 
    if len(string) < 2: 
     indexOfBlankStrings.append(index) 

for row in indexOfBlankStrings: 
    dataY.drop(row, axis=0, inplace=True) 

""" 
This makes a BOW model out of all the tweets then creates a 
vector for each of the tweets containing all the words from 
the BOW model, each vector is the same size becuase the 
network expects it 
""" 
def vectorise(tokenizer, list): 
    return tokenizer.fit_on_texts(list) 

#Make BOW model and vectorise it 
t = Tokenizer(lower=False, num_words=1000) 
t.fit_on_texts(dataX.iloc[:,0].tolist()) 
t.fit_on_texts(dataX.iloc[:,0].tolist()) 

""" 
Here im experimenting with multiple layers of the total 
amount of words in the syllabus divided by ^2 - This 
has given me quite accurate results compared to random guess's 
of amount of neron's. 
""" 
l1 = int(xData.shape[0]/4) #To big for my GPU 
l2 = int(xData.shape[0]/8) #To big for my GPU 
l3 = int(xData.shape[0]/16) 
l4 = int(xData.shape[0]/32) 
l5 = int(xData.shape[0]/64) 
l6 = int(xData.shape[0]/128) 


#Make the model 
model = Sequential() 
model.add(Dense(l1, input_dim=xData.shape[1])) 
model.add(Dropout(0.15)) 
model.add(Dense(l2)) 
model.add(Dropout(0.2)) 
model.add(Dense(l3)) 
model.add(Dropout(0.2)) 
model.add(Dense(l4)) 
model.add(Dense(1, activation='relu')) 

#Compile the model 
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc']) 

""" 
This here will use multiple batches to train the model. 
    startIndex: 
     This is the starting index of the array for which you want to 
     start training the network from. 
    dataRange: 
     The number of elements use to train the network in each batch so 
     since dataRange = 1000 this mean it goes from 
     startIndex...dataRange OR 0...1000 
    amountOfEpochs: 
     This is kinda self explanitory, the more Epochs the more it 
     is supposed to learn AKA updates the optimisation algo numbers 
""" 
amountOfEpochs = 1 
dataRange = 1000 
startIndex = 0 

def generator(tokenizer, data, labels, totalSize=maxSamples, startIndex=0): 
    l = labels.as_matrix() 
    while True: 
     for i in range(startIndex, totalSize): 
      batch_features = tokenizer.texts_to_matrix(xData.iloc[i]) 
      batch_labels = l[i] 
      yield batch_features, batch_labels 

derp = generator(t, data=xData, labels=dataY) 
##This runs the model for batch AKA load a little them process then load a little more 
for amountOfData in range(1000, maxSamples, 1000): 
    #(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData])) 
    history = model.fit_generator(generator=generator(tokenizer=t, 
              data=xData, 
              labels=dataY), 
              steps_per_epoch=1, 
              epochs=10) 

感謝

+0

問題是你有1000個樣本中,你X輸入矩陣,1個在您的輸出Y矩陣 – DJK

+0

但Y矩陣中的1是情緒。每個Tweet應該只有1或0 – Definity

回答

0

您所遇到的問題是,在您的輸入數組的樣本數量,不等於你的目標數組中的樣本數量。這意味着矩陣中的行數不匹配。問題來自您的發電機功能。您將數據索引爲

batch_labels = l[i] 

它只返回一個樣本(矩陣行)。當它應該是類似的東西...

batch_labels = l[i:i+1000] 

但是,還有其他問題與您使用fit_generator。你不應該在循環中使用它。我不明白它是如何使程序受益的,並且在循環中調用fit_generator會挫傷使用生成器的目的。該功能你會用訓練中的數據的一個單獨的批次將

train_on_batch() 

所看到的docs

相關問題