2016-06-14 169 views
6

我正在運行基於IMDB example的BLSTM,但我的版本不是分類,而是標籤的序列預測。爲了簡單起見,您可以將其視爲POS標記模型。輸入是單詞的句子,輸出是標籤。該示例中使用的語法在語法上與大多數其他Keras示例略有不同,因爲它不使用model.add,但啓動序列。我無法弄清楚如何在這個略有不同的語法中添加一個遮罩層。掩蔽keras BLSTM

我已經運行模型並進行了測試,它工作正常,但它預測和評估0的準確性,這是我的填充。下面的代碼:

from __future__ import print_function 
import numpy as np 
from keras.preprocessing import sequence 
from keras.models import Model 
from keras.layers.core import Masking 
from keras.layers import TimeDistributed, Dense 
from keras.layers import Dropout, Embedding, LSTM, Input, merge 
from prep_nn import prep_scan 
from keras.utils import np_utils, generic_utils 


np.random.seed(1337) # for reproducibility 
nb_words = 20000 # max. size of vocab 
nb_classes = 10 # number of labels 
hidden = 500 # 500 gives best results so far 
batch_size = 10 # create and update net after 10 lines 
val_split = .1 
epochs = 15 

# input for X is multi-dimensional numpy array with IDs, 
# one line per array. input y is multi-dimensional numpy array with 
# binary arrays for each value of each label. 
# maxlen is length of longest line 
print('Loading data...') 
(X_train, y_train), (X_test, y_test) = prep_scan(
    nb_words=nb_words, test_len=75) 

print(len(X_train), 'train sequences') 
print(int(len(X_train)*val_split), 'validation sequences') 
print(len(X_test), 'heldout sequences') 

# this is the placeholder tensor for the input sequences 
sequence = Input(shape=(maxlen,), dtype='int32') 

# this embedding layer will transform the sequences of integers 
# into vectors 
embedded = Embedding(nb_words, output_dim=hidden, 
        input_length=maxlen)(sequence) 

# apply forwards LSTM 
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded) 
# apply backwards LSTM 
backwards = LSTM(output_dim=hidden, return_sequences=True, 
       go_backwards=True)(embedded) 

# concatenate the outputs of the 2 LSTMs 
merged = merge([forwards, backwards], mode='concat', concat_axis=-1) 
after_dp = Dropout(0.15)(merged) 

# TimeDistributed for sequence 
# change activation to sigmoid? 
output = TimeDistributed(
    Dense(output_dim=nb_classes, 
      activation='softmax'))(after_dp) 

model = Model(input=sequence, output=output) 

# try using different optimizers and different optimizer configs 
# loss=binary_crossentropy, optimizer=rmsprop 
model.compile(loss='categorical_crossentropy', 
       metrics=['accuracy'], optimizer='adam') 

print('Train...') 
model.fit(X_train, y_train, 
      batch_size=batch_size, 
      nb_epoch=epochs, 
      shuffle=True, 
      validation_split=val_split) 

UPDATE:

我歸入本PR並得到了它在嵌入層mask_zero=True工作。但我現在意識到模型的糟糕表現後,我還需要在輸出中進行掩蓋,其他人建議使用sample_weight代替model.fit這一行。我怎麼能這樣做忽略0?

更新2:

所以我讀this並想出sample_weight爲1和0的矩陣。我認爲它可能一直在工作,但我的準確度大約在50%左右,我發現它試圖預測填充的部分,但不會像現在使用sample_weight之前的問題那樣預測它們爲0。

當前代碼:

from __future__ import print_function 
import numpy as np 
from keras.preprocessing import sequence 
from keras.models import Model 
from keras.layers.core import Masking 
from keras.layers import TimeDistributed, Dense 
from keras.layers import Dropout, Embedding, LSTM, Input, merge 
from prep_nn import prep_scan 
from keras.utils import np_utils, generic_utils 
import itertools 
from itertools import chain 
from sklearn.preprocessing import LabelBinarizer 
import sklearn 
import pandas as pd 


np.random.seed(1337) # for reproducibility 
nb_words = 20000 # max. size of vocab 
nb_classes = 10 # number of labels 
hidden = 500 # 500 gives best results so far 
batch_size = 10 # create and update net after 10 lines 
val_split = .1 
epochs = 10 

# input for X is multi-dimensional numpy array with syll IDs, 
# one line per array. input y is multi-dimensional numpy array with 
# binary arrays for each value of each label. 
# maxlen is length of longest line 
print('Loading data...') 
(X_train, y_train), (X_test, y_test), maxlen, sylls_ids, tags_ids, weights = prep_scan(nb_words=nb_words, test_len=75) 

print(len(X_train), 'train sequences') 
print(int(len(X_train) * val_split), 'validation sequences') 
print(len(X_test), 'heldout sequences') 

# this is the placeholder tensor for the input sequences 
sequence = Input(shape=(maxlen,), dtype='int32') 

# this embedding layer will transform the sequences of integers 
# into vectors of size 256 
embedded = Embedding(nb_words, output_dim=hidden, 
        input_length=maxlen, mask_zero=True)(sequence) 

# apply forwards LSTM 
forwards = LSTM(output_dim=hidden, return_sequences=True)(embedded) 
# apply backwards LSTM 
backwards = LSTM(output_dim=hidden, return_sequences=True, 
       go_backwards=True)(embedded) 

# concatenate the outputs of the 2 LSTMs 
merged = merge([forwards, backwards], mode='concat', concat_axis=-1) 
# after_dp = Dropout(0.)(merged) 

# TimeDistributed for sequence 
# change activation to sigmoid? 
output = TimeDistributed(
    Dense(output_dim=nb_classes, 
      activation='softmax'))(merged) 

model = Model(input=sequence, output=output) 

# try using different optimizers and different optimizer configs 
# loss=binary_crossentropy, optimizer=rmsprop 
model.compile(loss='categorical_crossentropy', 
       metrics=['accuracy'], optimizer='adam', 
       sample_weight_mode='temporal') 

print('Train...') 
model.fit(X_train, y_train, 
      batch_size=batch_size, 
      nb_epoch=epochs, 
      shuffle=True, 
      validation_split=val_split, 
      sample_weight=weights) 
+0

這是一個古老的問題,但你解決了這個問題嗎?我在同一階段...我發現[精度不考慮'sample_weight'](https://github.com/fchollet/keras/issues/1642),根據我的測試,既不掩蔽(實際上使用掩蔽會產生不同的準確度值,以至於我還無法解決)。我最終可能會使用功能性API來建立第二個輸出並且精確。 – jdehesa

+0

重溫這個問題並簡化它關於目前的Keras代碼將不勝感激。 – Seanny123

回答

1

你解決這個問題?我不清楚你的代碼如何處理填充值和單詞索引。關於讓字索引從什麼開始1根據https://keras.io/layers/embeddings/定義的

embedded = Embedding(nb_words + 1, output_dim=hidden, 
       input_length=maxlen, mask_zero=True)(sequence) 

代替

embedded = Embedding(nb_words, output_dim=hidden, 
       input_length=maxlen, mask_zero=True)(sequence)