2016-06-20 66 views
0

我使用下面的方法提取從熊貓數據幀中的n-gram的NGRAM術語:獲取的頻率,爲每個sklearn

def extractNGrams(df, ngram_size, min_freq): 
    """Extract NGrams from a list of Strings 
    Keyword arguments: 
    df -- the pandas dataframe containing the sentences 
    ngram_size -- defining the n for ngrams 
    min_freq --- the minimum frequency for the ngram to be part of the set 
    """ 
    vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq) 
    lstSentences = df['Text'].values.tolist() 
    X_train_counts = vect.fit_transform(lstSentences)  
    vocab = vect.get_feature_names() 
    #print (vocab) 
    print (X_train_counts.shape) 
    return vocab 

我想了解,以獲得頻率爲每個NGRAM的方式條款?

+1

在你定義你的條件和特徵指數之間的映射的詞彙變量的代碼。例如{「word1」:0,「word2」:1}。您需要的頻率由變量X_train_counts的非零項給出。也就是說,如果第一列的值是2,那麼「word1」會出現兩次。這有幫助嗎? – geompalik

+0

@geompalik明白了.. !!它有助於..!!謝謝!! – Bonson

回答

0

發佈我用於獲取計數

train_data_features = X_train_counts.toarray() 
vocab = vect.get_feature_names() 
dist = np.sum(train_data_features, axis=0) 
ngram_freq = {} 

# For each, print the vocabulary word and the frequency 
for tag, count in zip(vocab, dist): 
    #print(tag, count) 
    ngram_freq[tag]=count 
+0

請勿使用'.toarray()',因爲這會將稀疏矩陣轉換爲稠密矩陣。只是把它留下 –