0

我正在嘗試使用樸素貝葉斯文本分類器的文本分類。 我的數據是以下格式,並根據問題和摘錄我必須決定問題的主題。培訓數據有超過20K條記錄。我知道SVM在這裏會更好,但我想用Naive Bayes using sklearn library如何使用sklearn庫進行樸素貝葉斯文本分類?

{[{"topic":"electronics","question":"What is the effective differencial effective of this circuit","excerpt":"I'm trying to work out, in general terms, the effective capacitance of this circuit (see diagram: http://i.stack.imgur.com/BS85b.png). \n\nWhat is the effective capacitance of this circuit and will the ...\r\n  "}, 
{"topic":"electronics","question":"Outlet Installation--more wires than my new outlet can use [on hold]","excerpt":"I am replacing a wall outlet with a Cooper Wiring USB outlet (TR7745). The new outlet has 3 wires coming out of it--a black, a white, and a green. Each one needs to be attached with a wire nut to ...\r\n  "}]} 

這是我到目前爲止已經試過,

import numpy as np 
import json 
from sklearn.naive_bayes import * 

topic = [] 
question = [] 
excerpt = [] 

with open('training.json') as f: 
    for line in f: 
     data = json.loads(line) 
     topic.append(data["topic"]) 
     question.append(data["question"]) 
     excerpt.append(data["excerpt"]) 

unique_topics = list(set(topic)) 
new_topic = [x.encode('UTF8') for x in topic] 
numeric_topics = [name.replace('gis', '1').replace('security', '2').replace('photo', '3').replace('mathematica', '4').replace('unix', '5').replace('wordpress', '6').replace('scifi', '7').replace('electronics', '8').replace('android', '9').replace('apple', '10') for name in new_topic] 
numeric_topics = [float(i) for i in numeric_topics] 

x1 = np.array(question) 
x2 = np.array(excerpt) 
X = zip(*[x1,x2]) 
Y = np.array(numeric_topics) 
print X[0] 
clf = BernoulliNB() 
clf.fit(X, Y) 
print "Prediction:", clf.predict(['hello']) 

但作爲預期我得到ValueError異常:無法將字符串轉換爲浮動。我的問題是如何創建一個簡單的分類器來分類相關主題的問題和摘錄?

回答

4

sklearn中的所有分類器都需要將輸入表示爲某個固定維度的向量。對於文本有CountVectorizer,HashingVectorizerTfidfVectorizer它可以將您的字符串轉換爲浮動數字的向量。

vect = TfidfVectorizer() 
X = vect.fit_transform(X) 

很顯然,你需要向量化的測試集以同樣的方式

clf.predict(vect.transform(['hello'])) 

看到一個tutorial on using sklearn with textual data

+0

我得到錯誤AttributeError:'元組'對象沒有屬性'低',而使用X = vect.fit_transform(X),X是一個迭代列表。 –

+0

這是一個numpy數組問題。我修好了它。非常感謝您的幫助.. –