2016-09-27 29 views
5

有沒有什麼辦法可以在scikit學習庫上實現skip gram? 我已經用n-skim克手動生成了一個列表,並將其作爲CountVectorizer()方法的詞彙表傳遞給跳過列表。有什麼方法可以通過scikit-learn來實現skip gram嗎?

不幸的是,它在預測方面的表現非常差:只有63%的準確性。 但是,我使用ngram_range(min,max)從默認代碼獲得了012-上77-80%的準確性。

scikit有沒有更好的方法來實現skip-grams?

這裏是我的部分代碼:

corpus = GetCorpus() # This one get text from file as a list 

vocabulary = list(GetVocabulary(corpus,k,n)) 
# this one returns a k-skip n-gram 

vec = CountVectorizer(
      tokenizer=lambda x: x.split(), 
      ngram_range=(2,2), 
      stop_words=stopWords, 
      vocabulary=vocabulary) 

回答

6

向量化與跳躍克文scikit學習只是傳遞跳過克令牌的詞彙CountVectorizer將無法​​正常工作。您需要修改可以使用自定義分析器完成的令牌處理方式。下面是一個例子矢量器產生1-跳過-2-克,

from toolz import itertoolz, compose 
from toolz.curried import map as cmap, sliding_window, pluck 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 
    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     return lambda doc: self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(doc), 
       stop_words) 

    def _word_skip_grams(self, tokens, stop_words=None): 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens) 

例如,在this Wikipedia example

text = ['the rain in Spain falls mainly on the plain'] 

vect = SkipGramVectorizer() 
vect.fit(text) 
vect.get_feature_names() 

這將矢量化將產生下列標記,

['falls on', 'in falls', 'mainly the', 'on plain', 
'rain spain', 'spain mainly', 'the in'] 
+0

感謝您的回覆,兄弟。我會很快嘗試並讓你知道它。 –

4

我想出了我自己的跳躍向量化器的實現。它的靈感來自於this的帖子。爲了限制特徵空間,我還限制了不跳過句子的邊界(使用nltk.sent_tokenize)。這裏是我的代碼:

import nltk 
from itertools import combinations 
from toolz import compose 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 

    def __init__(self, k=1, **kwds): 
     super(SkipGramVectorizer, self).__init__(**kwds) 
     self.k=k 

    def build_sent_analyzer(self, preprocess, stop_words, tokenize): 
     return lambda sent : self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(sent), 
       stop_words) 

    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     sent_analyze = self.build_sent_analyzer(preprocess, stop_words, tokenize) 

     return lambda doc : self._sent_skip_grams(doc, sent_analyze) 

    def _sent_skip_grams(self, doc, sent_analyze): 
     skip_grams = [] 
     for sent in nltk.sent_tokenize(doc): 
      skip_grams.extend(sent_analyze(sent)) 
     return skip_grams 

    def _word_skip_grams(self, tokens, stop_words=None): 
     """Turn tokens into a sequence of n-grams after stop words filtering""" 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     # handle token n-grams 
     min_n, max_n = self.ngram_range 
     k = self.k 
     if max_n != 1: 
      original_tokens = tokens 
      if min_n == 1: 
       # no need to do any slicing for unigrams 
       # just iterate through the original tokens 
       tokens = list(original_tokens) 
       min_n += 1 
      else: 
       tokens = [] 

      n_original_tokens = len(original_tokens) 

      # bind method outside of loop to reduce overhead 
      tokens_append = tokens.append 
      space_join = " ".join 

      for n in xrange(min_n, 
          min(max_n + 1, n_original_tokens + 1)): 
       for i in xrange(n_original_tokens - n + 1): 
        # k-skip-n-grams 
        head = [original_tokens[i]]      
        for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
         tokens_append(space_join(head + list(skip_tail))) 
     return tokens 

def test(text, ngram_range, k): 
    vectorizer = SkipGramVectorizer(ngram_range=ngram_range, k=k) 
    vectorizer.fit_transform(text) 
    print(vectorizer.get_feature_names()) 

def main(): 
    text = ['Insurgents killed in ongoing fighting.'] 

    # 2-skip-bi-grams 
    test(text, (2,2), 2) 
    # 2-skip-tri-grams 
    test(text, (3,3), 2) 
############################################################################################### 
if __name__ == '__main__': 
    main() 

這將產生以下功能名稱:

[u'in fighting', u'in ongoing', u'insurgents in', u'insurgents killed', u'insurgents ongoing', u'killed fighting', u'killed in', u'killed ongoing', u'ongoing fighting'] 
[u'in ongoing fighting', u'insurgents in fighting', u'insurgents in ongoing', u'insurgents killed fighting', u'insurgents killed in', u'insurgents killed ongoing', u'insurgents ongoing fighting', u'killed in fighting', u'killed in ongoing', u'killed ongoing fighting'] 

請注意,我基本上是從VectorizerMixin類拿了_word_ngrams功能,取代了線

tokens_append(space_join(original_tokens[i: i + n])) 

與以下內容:

head = [original_tokens[i]]      
for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
    tokens_append(space_join(head + list(skip_tail))) 
相關問題