如何NLTK分類使用元數據

至於我可以使用NLTK分類理解的例子：如何NLTK分類使用元數據

他們似乎只處理句子本身的功能。所以，你必須......

corpus = 
[ 
("This is a sentence"), 
("This is another sentence") 
]

...和你申請的一些功能，如count_words_ending_in_a_vowel（）來了句本身。

相反，我想一塊之外的數據應用到句子，不是從文本本身產生的，但外部的標籤，如：

corpus = 
[ 
("This is a sentence", "awesome"), 
("This is another sentence", "not awesome") 
]

或者

corpus = 
[ 
{"text": "This is a sentence", "label": "awesome"}, 
{"text": "This is another sentence", "label": "not awesome"} 
]

（如果我可能有多個外部標籤）

我的問題是：鑑於我的數據集中包含這些外部標籤，如何將語料庫重新格式化爲格式NaiveBayesClassifier.train()預計？我知道我也需要在上面的「text」字段上應用tokenizer，但是我應該輸入到NaiveBayesClassifier.train函數中的總格式是什麼？

申請

classifier = nltk.NaiveBayesClassifier.train(goods) 
print(classifier.show_most_informative_features(32))

我的更廣泛的目標---我想在看詞頻如何鑑別是能夠預測的標籤，這套的話是最翔實從分離標籤彼此。這種類型具有k-means的感覺，但我被告知我應該能夠在NLTK中完全做到這一點，並且只是在將其表達爲適當的數據輸入格式時遇到了麻煩。

來源

2013-12-18 Mittenchops

我曾與下面的方法成功：

train = [({'some': True, 'tokens': True}, 'label'), 
     ({'other': True, 'word': True}, 'different label'), 
     ({'cool': True, 'document': True}, 'label')] 
classifier = nltk.NaiveBayesClassifier.train(train)

所以train是文檔的列表（每個元組）。每個元組的第一個元素是一個令牌字典（令牌是密鑰，值爲True，用於指示該令牌的存在），第二個元素是與該文檔相關聯的標籤。

來源

2013-12-18 23:10:42 ChrisP

嗯，我的數據是在你所描述的格式，我的分類保存返回'>>>打印classifier.show_most_informative_features（4）大多數信息量大的特點無 '。我認爲這意味着我有一個語法錯誤。但它似乎意味着我的數據/模型有問題？ – Mittenchops

如何NLTK分類使用元數據

回答

相關問題