有什麼方法可以通過scikit-learn來實現skip gram嗎？

有沒有什麼辦法可以在scikit學習庫上實現skip gram？我已經用n-skim克手動生成了一個列表，並將其作爲CountVectorizer()方法的詞彙表傳遞給跳過列表。有什麼方法可以通過scikit-learn來實現skip gram嗎？

不幸的是，它在預測方面的表現非常差：只有63％的準確性。但是，我使用ngram_range(min,max)從默認代碼獲得了012-上77-80％的準確性。

scikit有沒有更好的方法來實現skip-grams？

這裏是我的部分代碼：

corpus = GetCorpus() # This one get text from file as a list 

vocabulary = list(GetVocabulary(corpus,k,n)) 
# this one returns a k-skip n-gram 

vec = CountVectorizer(
      tokenizer=lambda x: x.split(), 
      ngram_range=(2,2), 
      stop_words=stopWords, 
      vocabulary=vocabulary)

來源

2016-09-27 Md. Sulayman

向量化與跳躍克文scikit學習只是傳遞跳過克令牌的詞彙CountVectorizer將無法正常工作。您需要修改可以使用自定義分析器完成的令牌處理方式。下面是一個例子矢量器產生1-跳過-2-克，

from toolz import itertoolz, compose 
from toolz.curried import map as cmap, sliding_window, pluck 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 
    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     return lambda doc: self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(doc), 
       stop_words) 

    def _word_skip_grams(self, tokens, stop_words=None): 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens)

例如，在this Wikipedia example，

text = ['the rain in Spain falls mainly on the plain'] 

vect = SkipGramVectorizer() 
vect.fit(text) 
vect.get_feature_names()

這將矢量化將產生下列標記，

['falls on', 'in falls', 'mainly the', 'on plain', 
'rain spain', 'spain mainly', 'the in']

來源

2017-09-01 10:08:13 rth

感謝您的回覆，兄弟。我會很快嘗試並讓你知道它。 –

我想出了我自己的跳躍向量化器的實現。它的靈感來自於this的帖子。爲了限制特徵空間，我還限制了不跳過句子的邊界（使用nltk.sent_tokenize）。這裏是我的代碼：

import nltk 
from itertools import combinations 
from toolz import compose 
from sklearn.feature_extraction.text import CountVectorizer 

class SkipGramVectorizer(CountVectorizer): 

    def __init__(self, k=1, **kwds): 
     super(SkipGramVectorizer, self).__init__(**kwds) 
     self.k=k 

    def build_sent_analyzer(self, preprocess, stop_words, tokenize): 
     return lambda sent : self._word_skip_grams(
       compose(tokenize, preprocess, self.decode)(sent), 
       stop_words) 

    def build_analyzer(self):  
     preprocess = self.build_preprocessor() 
     stop_words = self.get_stop_words() 
     tokenize = self.build_tokenizer() 
     sent_analyze = self.build_sent_analyzer(preprocess, stop_words, tokenize) 

     return lambda doc : self._sent_skip_grams(doc, sent_analyze) 

    def _sent_skip_grams(self, doc, sent_analyze): 
     skip_grams = [] 
     for sent in nltk.sent_tokenize(doc): 
      skip_grams.extend(sent_analyze(sent)) 
     return skip_grams 

    def _word_skip_grams(self, tokens, stop_words=None): 
     """Turn tokens into a sequence of n-grams after stop words filtering""" 
     # handle stop words 
     if stop_words is not None: 
      tokens = [w for w in tokens if w not in stop_words] 

     # handle token n-grams 
     min_n, max_n = self.ngram_range 
     k = self.k 
     if max_n != 1: 
      original_tokens = tokens 
      if min_n == 1: 
       # no need to do any slicing for unigrams 
       # just iterate through the original tokens 
       tokens = list(original_tokens) 
       min_n += 1 
      else: 
       tokens = [] 

      n_original_tokens = len(original_tokens) 

      # bind method outside of loop to reduce overhead 
      tokens_append = tokens.append 
      space_join = " ".join 

      for n in xrange(min_n, 
          min(max_n + 1, n_original_tokens + 1)): 
       for i in xrange(n_original_tokens - n + 1): 
        # k-skip-n-grams 
        head = [original_tokens[i]]      
        for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
         tokens_append(space_join(head + list(skip_tail))) 
     return tokens 

def test(text, ngram_range, k): 
    vectorizer = SkipGramVectorizer(ngram_range=ngram_range, k=k) 
    vectorizer.fit_transform(text) 
    print(vectorizer.get_feature_names()) 

def main(): 
    text = ['Insurgents killed in ongoing fighting.'] 

    # 2-skip-bi-grams 
    test(text, (2,2), 2) 
    # 2-skip-tri-grams 
    test(text, (3,3), 2) 
############################################################################################### 
if __name__ == '__main__': 
    main()

這將產生以下功能名稱：

[u'in fighting', u'in ongoing', u'insurgents in', u'insurgents killed', u'insurgents ongoing', u'killed fighting', u'killed in', u'killed ongoing', u'ongoing fighting'] 
[u'in ongoing fighting', u'insurgents in fighting', u'insurgents in ongoing', u'insurgents killed fighting', u'insurgents killed in', u'insurgents killed ongoing', u'insurgents ongoing fighting', u'killed in fighting', u'killed in ongoing', u'killed ongoing fighting']

請注意，我基本上是從VectorizerMixin類拿了_word_ngrams功能，取代了線

tokens_append(space_join(original_tokens[i: i + n]))

與以下內容：

head = [original_tokens[i]]      
for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1): 
    tokens_append(space_join(head + list(skip_tail)))

來源

2017-12-19 20:27:44 Characeae

有什麼方法可以通過scikit-learn來實現skip gram嗎？

回答

相關問題