首先,我認爲這是在本教程的一個錯字上http://www.nltk.org/book/ch06.html
詞表語料庫不能像一個列表訪問。
>>> from nltk.corpus import names
>>> names[:5]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'LazyCorpusLoader' object has no attribute '__getitem__'
>>> names.words()[:5]
[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi']
下一頁在這裏看到的是什麼呢apply_features
(https://github.com/nltk/nltk/blob/develop/nltk/classify/util.py#L28)。
基本上,給定[('input_1', 'label_1'), ...('input_N', 'label_N')]
的元組列表,它返回[(feature_func(tok), label) for (tok, label) in toks]
。例如。
# To get the input list of tuples for apply_features, we do this:
>>> [(word,'female') for word in names.words('female.txt')[:10]]
[(u'Abagael', 'female'), (u'Abagail', 'female'), (u'Abbe', 'female'), (u'Abbey', 'female'), (u'Abbi', 'female'), (u'Abbie', 'female'), (u'Abby', 'female'), (u'Abigael', 'female'), (u'Abigail', 'female'), (u'Abigale', 'female')]
# Let's get 250 from female and 250 from male names.
>>> train_female = [(word,'female') for word in names.words('female.txt')[:250]]
>>> train_male = [(word,'male') for word in names.words('male.txt')[:250]]
>>> train_data = train_female + train_male
>>> apply_features(gender_features, train_data)
[({'last_letter': u'l'}, 'female'), ({'last_letter': u'l'}, 'female'), ...]
完整的代碼,以獲得Naivebayes在NLTK工作的名稱文集:
from nltk.corpus import names
from nltk.classify import apply_features, NaiveBayesClassifier
def gender_features(word):
return {'last_letter': word[-1]}
train_female = [(word,'female') for word in names.words('female.txt')[:250]]
train_male = [(word,'male') for word in names.words('male.txt')[:250]]
train_data = train_female + train_male
train_set = apply_features(gender_features, train_data)
# Do like wise for the test set.
'''
test_female = [(word,'female') for word in names.words('female.txt')[250:]]
test_male = [(word,'male') for word in names.words('male.txt')[250:]]
test_data = test_female + test_male
test_set = apply_features(gender_features, test_data)
'''
classifier = NaiveBayesClassifier.train(train_set)
print classifier.classify(gender_features('Neo'))
[出]:
'male'
謝謝。你的解釋很有幫助。我意識到錯誤是由於名稱[500:]'。它應該是'labeled_names [500:]'。 – CSK 2015-02-13 00:01:57