獲取的頻率，爲每個sklearn

我使用下面的方法提取從熊貓數據幀中的n-gram的NGRAM術語：獲取的頻率，爲每個sklearn

def extractNGrams(df, ngram_size, min_freq): 
    """Extract NGrams from a list of Strings 
    Keyword arguments: 
    df -- the pandas dataframe containing the sentences 
    ngram_size -- defining the n for ngrams 
    min_freq --- the minimum frequency for the ngram to be part of the set 
    """ 
    vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size), min_df=min_freq) 
    lstSentences = df['Text'].values.tolist() 
    X_train_counts = vect.fit_transform(lstSentences)  
    vocab = vect.get_feature_names() 
    #print (vocab) 
    print (X_train_counts.shape) 
    return vocab

我想了解，以獲得頻率爲每個NGRAM的方式條款？

來源

2016-06-20 Bonson

在你定義你的條件和特徵指數之間的映射的詞彙變量的代碼。例如{「word1」：0，「word2」：1}。您需要的頻率由變量X_train_counts的非零項給出。也就是說，如果第一列的值是2，那麼「word1」會出現兩次。這有幫助嗎？ – geompalik

@geompalik明白了.. !!它有助於..！！謝謝！！ – Bonson

發佈我用於獲取計數

train_data_features = X_train_counts.toarray() 
vocab = vect.get_feature_names() 
dist = np.sum(train_data_features, axis=0) 
ngram_freq = {} 

# For each, print the vocabulary word and the frequency 
for tag, count in zip(vocab, dist): 
    #print(tag, count) 
    ngram_freq[tag]=count

來源

2016-06-22 07:49:56 Bonson

請勿使用'.toarray（）'，因爲這會將稀疏矩陣轉換爲稠密矩陣。只是把它留下 –

獲取的頻率，爲每個sklearn

回答

相關問題