0
到目前爲止,我試圖爲情感分析實現適合生成器,因爲我只有一個小的PGU和大數據集。不過,我不斷收到此錯誤Keras fit_generator()&輸入數組應該與目標示例相同
Using Theano backend.
Can not use cuDNN on context None: cannot compile with cuDNN. We got this error:
b'In file included from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/driver_types.h:53:0,\r\n from C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/cudnn.h:63,\r\n from C:\\Users\\Def\\AppData\\Local\\Temp\\try_flags_p2iwer2o.c:4:\r\nC:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v8.0\\include/host_defines.h:84:0: warning: "__cdecl" redefined\r\n #define __cdecl\r\n ^\r\n<built-in>: note: this is the location of the previous definition\r\nd000029.o:(.idata$5+0x0): multiple definition of `__imp___C_specific_handler\'\r\nd000026.o:(.idata$5+0x0): first defined here\r\nC:/Users/Def/Anaconda3/envs/Final/Library/mingw-w64/bin/../lib/gcc/x86_64-w64-mingw32/5.3.0/../../../../x86_64-w64-mingw32/lib/../lib/crt2.o: In function `__tmainCRTStartup\':\r\nC:/repo/mingw-w64-crt-git/src/mingw-w64/mingw-w64-crt/crt/crtexe.c:285: undefined reference to `_set_invalid_parameter_handler\'\r\ncollect2.exe: error: ld returned 1 exit status\r\n'
Mapped name None to device cuda: GeForce GTX 960M (0000:01:00.0)
Epoch 1/10
Traceback (most recent call last):
File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 136, in <module>
epochs=10)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\models.py", line 1097, in fit_generator
initial_epoch=initial_epoch)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\legacy\interfaces.py", line 88, in wrapper
return func(*args, **kwargs)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1876, in fit_generator
class_weight=class_weight)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1614, in train_on_batch
check_batch_axis=True)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 1307, in _standardize_user_data
_check_array_lengths(x, y, sample_weights)
File "C:\Users\Def\Anaconda3\envs\Final\lib\site-packages\keras\engine\training.py", line 229, in _check_array_lengths
'and ' + str(list(set_y)[0]) + ' target samples.')
ValueError: Input arrays should have the same number of samples as target arrays. Found 1000 input samples and 1 target samples.
我有一個矩陣就是1000元長久以來我只擁有這是在標記生成器()指定的1000個字的最大語料庫。
然後我有情緒,這是一個0爲負面或1爲正面。
我的問題是爲什麼我收到錯誤?我試圖對數據和標籤使用轉換,但仍然收到相同的錯誤。這是我的代碼。
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.preprocessing.text import Tokenizer
import numpy as np
import pandas as pd
import pickle
import matplotlib.pyplot as plt
import re
"""
the amount of samples out to the 1 million to use, my 960m 2GB can only handle
about 30,000ish at the moment depending on a number of neurons in the
deep layer and a number of layers.
"""
maxSamples = 3000
#Load the CSV and get the correct columns
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv")
dx = pd.DataFrame()
dy = pd.DataFrame()
dy[['Sentiment']] = data[['Sentiment']]
dx[['SentimentText']] = data[['SentimentText']]
dataY = dy.iloc[0:maxSamples]
dataX = dx.iloc[0:maxSamples]
testY = dy.iloc[maxSamples: maxSamples + 1000]
testX = dx.iloc[maxSamples: maxSamples + 1000]
"""
here I filter the data and clean it up by removing @ tags, hyperlinks and
also any characters that are not alpha-numeric.
"""
def removeTagsAndLinks(dataframe):
for x in dataframe.iterrows():
#Removes Hyperlinks
x[1].values[0] = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?", "", str(x[1].values[0]))
#Removes @ tags
x[1].values[0] = re.sub("@\\w+", '', str(x[1].values[0]))
#keeps only alpha-numeric chars
x[1].values[0] = re.sub("\W+", ' ', str(x[1].values[0]))
return dataframe
xData = removeTagsAndLinks(dataX)
xTest = removeTagsAndLinks(testX)
"""
This loop looks for any Tweets with characters shorter than 2 and once found write the
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the
list of Tweets later
"""
indexOfBlankStrings = []
for index, string in enumerate(xData):
if len(string) < 2:
indexOfBlankStrings.append(index)
for row in indexOfBlankStrings:
dataY.drop(row, axis=0, inplace=True)
"""
This makes a BOW model out of all the tweets then creates a
vector for each of the tweets containing all the words from
the BOW model, each vector is the same size becuase the
network expects it
"""
def vectorise(tokenizer, list):
return tokenizer.fit_on_texts(list)
#Make BOW model and vectorise it
t = Tokenizer(lower=False, num_words=1000)
t.fit_on_texts(dataX.iloc[:,0].tolist())
t.fit_on_texts(dataX.iloc[:,0].tolist())
"""
Here im experimenting with multiple layers of the total
amount of words in the syllabus divided by ^2 - This
has given me quite accurate results compared to random guess's
of amount of neron's.
"""
l1 = int(xData.shape[0]/4) #To big for my GPU
l2 = int(xData.shape[0]/8) #To big for my GPU
l3 = int(xData.shape[0]/16)
l4 = int(xData.shape[0]/32)
l5 = int(xData.shape[0]/64)
l6 = int(xData.shape[0]/128)
#Make the model
model = Sequential()
model.add(Dense(l1, input_dim=xData.shape[1]))
model.add(Dropout(0.15))
model.add(Dense(l2))
model.add(Dropout(0.2))
model.add(Dense(l3))
model.add(Dropout(0.2))
model.add(Dense(l4))
model.add(Dense(1, activation='relu'))
#Compile the model
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc'])
"""
This here will use multiple batches to train the model.
startIndex:
This is the starting index of the array for which you want to
start training the network from.
dataRange:
The number of elements use to train the network in each batch so
since dataRange = 1000 this mean it goes from
startIndex...dataRange OR 0...1000
amountOfEpochs:
This is kinda self explanitory, the more Epochs the more it
is supposed to learn AKA updates the optimisation algo numbers
"""
amountOfEpochs = 1
dataRange = 1000
startIndex = 0
def generator(tokenizer, data, labels, totalSize=maxSamples, startIndex=0):
l = labels.as_matrix()
while True:
for i in range(startIndex, totalSize):
batch_features = tokenizer.texts_to_matrix(xData.iloc[i])
batch_labels = l[i]
yield batch_features, batch_labels
derp = generator(t, data=xData, labels=dataY)
##This runs the model for batch AKA load a little them process then load a little more
for amountOfData in range(1000, maxSamples, 1000):
#(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData]))
history = model.fit_generator(generator=generator(tokenizer=t,
data=xData,
labels=dataY),
steps_per_epoch=1,
epochs=10)
感謝
問題是你有1000個樣本中,你X輸入矩陣,1個在您的輸出Y矩陣 – DJK
但Y矩陣中的1是情緒。每個Tweet應該只有1或0 – Definity