0
我想我或多或少地理解樸素貝葉斯,但是我對於其簡單的二進制文本分類tast的實現有幾個問題。基本概念:樸素貝葉斯算法的分類
假設文件D_i
就是詞彙的某個子集x_1, x_2, ...x_n
有兩類c_i
任何文件可以落在了,我想計算P(c_i|D)
某些輸入文檔d成比例P(D|c_i)P(c_i)
我有三個問題
P(c_i)
爲#docs in c_i/ #total docs
或#words in c_i/ #total words
- 應該
P(x_j|c_i)
是#times x_j appears in D/ #times x_j appears in c_i
- 假設一個
x_j
不訓練集中存在了,我給它的1的概率,這樣它不會改變計算?
例如,讓我們說,我有一個訓練集:
training = [("hello world", "good")
("bye world", "bad")]
這樣的類必須
good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}
所以現在如果我想計算的概率測試字符串不錯
test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set
p_test1_good = p_good * p_hello_good * p_again_good
號,P(xⱼ|cᵢ)是類cᵢxⱼ的頻率,通過項的總數在類的所有文件分。 – 2014-09-21 14:06:22
@larsmans對不起,我沒有注意到.... – Devavrata 2014-09-22 17:20:29