2017-07-18 142 views
1

到目前爲止,我想出了這個哈克代碼在這裏,該代碼運行並輸出Keras fit_generator(),這是正確的用法嗎?

Epoch 10/10 
     1/3000 [..............................] - ETA: 27s - loss: 0.3075 - acc: 0.7270 
     6/3000 [..............................] - ETA: 54s - loss: 0.3075 - acc: 0.7355 
..... 
    2996/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337 
    2998/3000 [============================>.] - ETA: 0s - loss: 0.3076 - acc: 0.7337 
    3000/3000 [==============================] - 59s - loss: 0.3076 - acc: 0.7337  
    Traceback (most recent call last): 
     File "C:/Users/Def/PycharmProjects/KerasUkExpenditure/TweetParsing.py", line 140, in <module> 
     (loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData), 
    TypeError: 'History' object is not iterable 

    Process finished with exit code 1 

我很迷茫「‘歷史’對象不是可迭代」,這是什麼意思?

這是我第一次嘗試做批量訓練和測試,我不確定我是否已經正確實施它,因爲我在網上看到的大多數例子都是針對圖片的。下面是代碼

from keras.models import Sequential 
from keras.layers import Dense, Dropout 
from keras.preprocessing.text import Tokenizer 
import numpy as np 
import pandas as pd 
import pickle 
import matplotlib.pyplot as plt 

import re 

""" 
amount of samples out to the 1 million to use, my 960m 2GB can only handel 
about 30,000ish at the moment depending on the amount of neurons in the 
deep layer and the amount fo layers. 
""" 
maxSamples = 3000 

#Load the CSV and get the correct columns 
data = pd.read_csv("C:\\Users\\Def\\Desktop\\Sentiment Analysis Dataset1.csv") 
dataX = pd.DataFrame() 
dataY = pd.DataFrame() 
dataY[['Sentiment']] = data[['Sentiment']] 
dataX[['SentimentText']] = data[['SentimentText']] 

dataY = dataY.iloc[0:maxSamples] 
dataX = dataX.iloc[0:maxSamples] 

testY = dataY.iloc[-1: -maxSamples] 
testX = dataX.iloc[-1: -maxSamples] 


""" 
here I filter the data and clean it up bu remove @ tags and hyper links and 
also any characters that are not alpha numeric, I then add it to the vec list 
""" 
def removeTagsAndLinks(dataframe): 
    vec = [] 
    for x in dataframe.iterrows(): 
     #Removes Hyperlinks 
     zero = re.sub("(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\[email protected]?^=%&/~+#-])?", "", x[1].values[0]) 
     #Removes @ tags 
     one = re.sub("@\\w+", '', zero) 
     #keeps only alpha-numeric chars 
     two = re.sub("\W+", ' ', one) 
     vec.append(two) 
    return vec 

vec = removeTagsAndLinks(dataX) 
xTest = removeTagsAndLinks(testX) 
yTest = removeTagsAndLinks(testY) 
""" 
This loop looks for any Tweets with characters shorter than 2 and once found write the 
index of that Tweet to an array so I can remove from the Dataframe of sentiment and the 
list of Tweets later 
""" 
indexOfBlankStrings = [] 
for index, string in enumerate(vec): 
    if len(string) < 2: 
     del vec[index] 
     indexOfBlankStrings.append(index) 

for row in indexOfBlankStrings: 
    dataY.drop(row, axis=0, inplace=True) 


""" 
This makes a BOW model out of all the tweets then creates a 
vector for each of the tweets containing all the words from 
the BOW model, each vector is the same size becuase the 
network expects it 
""" 

def vectorise(tokenizer, list): 
    tokenizer.fit_on_texts(list) 
    return tokenizer.texts_to_matrix(list) 

#Make BOW model and vectorise it 
t = Tokenizer(lower=False, num_words=1000) 
dim = vectorise(t, vec) 

xTest = vectorise(t, xTest) 

""" 
Here im experimenting with multiple layers of the total 
amount of words in the syllabus divided by ^2 - This 
has given me quite accurate results compared to random guess's 
of amount of neron's and amounts of layers. 
""" 
l1 = int(len(dim[0])/4) #To big for my GPU 
l2 = int(len(dim[0])/8) #To big for my GPU 
l3 = int(len(dim[0])/16) 
l4 = int(len(dim[0])/32) 
l5 = int(len(dim[0])/64) 
l6 = int(len(dim[0])/128) 


#Make the model 
model = Sequential() 
model.add(Dense(l1, input_dim=dim.shape[1])) 
model.add(Dropout(0.15)) 
model.add(Dense(l2)) 
model.add(Dense(l1)) 
model.add(Dense(l3)) 
model.add(Dropout(0.2)) 
model.add(Dense(l4)) 
model.add(Dense(1, activation='relu')) 

#Compile the model 
model.compile(optimizer='RMSProp', loss='binary_crossentropy', metrics=['acc']) 

""" 
This here will use multiple batches to train the model. 
    startIndex: 
     This is the starting index of the array for which you want to 
     start training the network from. 
    dataRange: 
     The number of elements use to train the network in each batch so 
     since dataRange = 1000 this mean it goes from 
     startIndex...dataRange OR 0...1000 
    amountOfEpochs: 
     This is kinda self explanitory, the more Epochs the more it 
     is supposed to learn AKA updates the optimisation algo numbers 
""" 
amountOfEpochs = 10 
dataRange = 1000 
startIndex = 0 

def generator(tokenizer, batchSize, totalSize=maxSamples, startIndex=0): 
    f = tokenizer.texts_to_sequences(vec[startIndex:totalSize]) 
    l = np.asarray(dataY.iloc[startIndex:totalSize]) 
    while True: 
     for i in range(1000, totalSize, batchSize): 
      batch_features = tokenizer.sequences_to_matrix(f[startIndex: batchSize]) 
      batch_labels = l[startIndex: batchSize] 
      yield batch_features, batch_labels 

##This runs the model for batch AKA load a little them process then load a little more 
for amountOfData in range(1000, maxSamples, 1000): 
    #(loss, acc) = model.train_on_batch(x=dim[startIndex:amountOfData], y=np.asarray(dataY.iloc[startIndex:amountOfData])) 
    (loss, acc) = model.fit_generator(generator(tokenizer=t, startIndex=startIndex,batchSize=amountOfData), 
             steps_per_epoch=maxSamples, epochs=amountOfEpochs, 
             validation_data=(np.array(xTest), np.array(yTest))) 
    startIndex += 1000 

對底部的部分是我一直在努力,實現fit_generator(),讓我自己的發電機,想裝入說75000個maxSamples然後訓練網絡,在同一時間1000個樣本直到它到達maxSample變量,這就是爲什麼我設置範圍來執行((0,maxSample,1000),我使用的發生器()是這是正確的使用?

我問,因爲我的網絡沒有使用驗證數據,它似乎非常快速地適合數據,這表明過度擬合或僅使用非常小的數據集。我是否正確地遍歷所有maxSamples int?或者我只是循環幾次第一次迭代?

由於

回答

1

問題在於這一行:

(loss, acc) = model.fit_generator(...) 

作爲fit_generator返回keras.callbacks.history類的單個對象。這就是爲什麼你有這個錯誤,因爲單一對象是不可迭代的。爲了獲取損失清單,您需要從history字段中檢索這個回調,這是一個記錄損失字典:

history = model.fit_generator(...) 

loss = history.history["loss"]