NLTK：文本分類使用自定義功能設置

2013-09-30 43 views 1 likes

featureDict = {identifier1: [[first 3-gram], [second 3-gram], ... [last 3-gram]], 
       ... 
       identifierN: [[first 3-gram], [second 3-gram], ... [last 3-gram]]}

另外，我有標籤的同組文件的字典：

labelDict = {identifier1: label1, 
      ... 
      identifierN: labelN}

我想找出最合適的nltk容器，我可以將這些信息存儲在一個地方，並無縫應用nltk分類器。

此外，在此數據集上使用任何分類器之前，我還希望在此功能空間上使用tf-idf過濾器。

參考和文檔將會有所幫助。

來源

2013-09-30 asb

回答

你只需要一個簡單的字典。看看NLTK classify interface using trained classifier中的片段。

這種情況的參考文檔仍然是NLTK書：http://nltk.org/book/ch06.html和API規範：http://nltk.org/api/nltk.classify.html

這裏有一些網頁，可以幫助你：http://snipperize.todayclose.com/snippet/py/Use-NLTK-Toolkit-to-Classify-Documents--5671027/，http://streamhacker.com/tag/feature-extraction/，http://web2dot5.wordpress.com/2012/03/21/text-classification-in-python/。

另外，請記住，nltk對於它提供的分類器算法是有限的。對於更高級的探索，你最好使用scikit-learn。

來源

2013-10-01 15:41:55

相關問題