Python nltk分類與大功能集（複製去等艾爾2009）

我試圖複製Go Et Al。 Twitter情緒分析可以在這裏找到http://help.sentiment140.com/for-students 我遇到的問題是功能的數量是364464.我目前使用nltk和nltk.NaiveBayesClassifier來做到這一點，其中推文持有1,600,000推文的複製和極性：Python nltk分類與大功能集（複製去等艾爾2009）

for tweet in tweets: 
    tweet[0] = extract_features(tweet[0], features) 

classifier = nltk.NaiveBayesClassifier.train(training_set) 
# print "NB Classified" 
classifier.show_most_informative_features() 
print(nltk.classify.util.accuracy(classifier, testdata))

一切並不需要很長距extract_features功能

def extract_features(tweet, featureList): 
    tweet_words = set(tweet) 
    features = {} 
    for word in featureList: 
     features['contains(%s)' % word] = (word in tweet_words) 
    return features

這是因爲每個鳴叫是研發規模364464的字典來代表的東西是否存在與否。

有沒有辦法讓這個更快或更高效，而不會像本文中那樣減少功能的數量？

來源

2016-04-20 Adam

我不知道爲什麼你不希望使用相同的技術在紙上。無論如何，您可以採取的基本NLP步驟包括：刪除停用詞，做一個tfidf矢量化並刪除不常用或非常常見的詞...這些也會刪除功能，但方式不同。正如我所說，我不確定你想要做什麼。 – lrnzcig

我可以想象你正遇到內存問題，但我設法解決它。謝謝回覆 – Adam

原來有一個美好的函數調用： nltk.classify.util.apply_features（），你可以找到herehttp：//www.nltk.org/api/nltk.classify.html

training_set = nltk.classify.apply_features(extract_features, tweets)

我不得不改變我的extract_features功能，但它現在與巨大的尺寸沒有內存問題。

這裏的函數描述的內幕：

此功能的主要目的是爲了避免開銷涉及存儲所有featuresets用於在語料庫任意令牌的存儲器。相反，這些功能集是根據需要構建的。當底層標記列表本身是懶惰的時（如許多語料庫讀取器的情況），內存開銷的減少尤其重要。

和我更改的功能：

def extract_features(tweet): 
     tweet_words = set(tweet) 
     global featureList 
     features = {} 
     for word in featureList: 
      features[word] = False 
     for word in tweet_words: 
      if word in featureList: 
       features[word] = True 
     return features

來源

2016-04-21 14:37:57 Adam

Python nltk分類與大功能集（複製去等艾爾2009）

回答

相關問題