我期待在NLTK Chapter 6的脈絡中做一些分類。這本書似乎跳過了創建類別的一步,我不確定我做錯了什麼。我在這裏有我的腳本與以下響應。我的問題主要來自第一部分 - 基於目錄名稱的類別創建。這裏的一些其他問題已經使用了文件名(即pos_1.txt
和neg_1.txt
),但我更願意創建可以將文件轉儲到的目錄。在NLTK/Python中使用電影評論語料庫進行分類
from nltk.corpus import movie_reviews
reviews = CategorizedPlaintextCorpusReader('./nltk_data/corpora/movie_reviews', r'(\w+)/*.txt', cat_pattern=r'/(\w+)/.txt')
reviews.categories()
['pos', 'neg']
documents = [(list(movie_reviews.words(fileid)), category)
for category in movie_reviews.categories()
for fileid in movie_reviews.fileids(category)]
all_words=nltk.FreqDist(
w.lower()
for w in movie_reviews.words()
if w.lower() not in nltk.corpus.stopwords.words('english') and w.lower() not in string.punctuation)
word_features = all_words.keys()[:100]
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in document_words)
return features
print document_features(movie_reviews.words('pos/11.txt'))
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print nltk.classify.accuracy(classifier, test_set)
classifier.show_most_informative_features(5)
這將返回:
File "test.py", line 38, in <module>
for w in movie_reviews.words()
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 184, in words
self, self._resolve(fileids, categories))
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/plaintext.py", line 91, in words
in self.abspaths(fileids, True, True)])
File "/usr/local/lib/python2.6/dist-packages/nltk/corpus/reader/util.py", line 421, in concat
raise ValueError('concat() expects at least one object!')
ValueError: concat() expects at least one object!
--------- ------------- UPDATE感謝 爲alvas您詳細的解答!然而,我有兩個問題。
- 是否有可能從我正在嘗試做的文件名抓取類別?我希望能夠以與
review_pos.txt
方法相同的方式進行,只從文件夾名稱而不是文件名中獲取pos
。 我跑你的代碼,並在第一
for
上train_set =[({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]] test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]
與胡蘿蔔我遇到一個語法錯誤。我是一名初學Python用戶,對於嘗試對其進行雙擊的語法不夠熟悉。
----更新2 ---- 錯誤是
File "review.py", line 17
for i in word_features}, tag)
^
SyntaxError: invalid syntax`
我寧願用我的方式來提取每個文件的類別。但你可以吃你自己的狗食(http://en.wikipedia.org/wiki/Eating_your_own_dog_food)。關於語法錯誤,您可以發佈控制檯上顯示的錯誤嗎? – alvas
已刪除 - 已添加到原始 – user3128184
您使用的是py2.7及以上版本嗎?由於字典理解,似乎語法失敗 – alvas