基本概念：樸素貝葉斯算法的分類

我想我或多或少地理解樸素貝葉斯，但是我對於其簡單的二進制文本分類tast的實現有幾個問題。基本概念：樸素貝葉斯算法的分類

假設文件D_i就是詞彙的某個子集x_1, x_2, ...x_n

有兩類c_i任何文件可以落在了，我想計算P(c_i|D)某些輸入文檔d成比例P(D|c_i)P(c_i)

我有三個問題

P(c_i)爲#docs in c_i/ #total docs或#words in c_i/ #total words
應該P(x_j|c_i)是#times x_j appears in D/ #times x_j appears in c_i
假設一個x_j不訓練集中存在了，我給它的1的概率，這樣它不會改變計算？

例如，讓我們說，我有一個訓練集：

training = [("hello world", "good") 
      ("bye world", "bad")]

這樣的類必須

good_class = {"hello": 1, "world": 1} 
bad_class = {"bye":1, "world:1"} 
all = {"hello": 1, "world": 2, "bye":1}

所以現在如果我想計算的概率測試字符串不錯

test1 = ["hello", "again"] 
p_good = sum(good_class.values())/sum(all.values()) 
p_hello_good = good_class["hello"]/all["hello"] 
p_again_good = 1 # because "again" doesn't exist in our training set 

p_test1_good = p_good * p_hello_good * p_again_good

來源

2014-09-21 yayu

由於這個問題太過分了廣告，所以我只能回答一個限制方式： -

1： - P（C_I）是C_I /＃合計文檔#docs或#words在C_I /＃合計的話

P(c_i) = #c_i/#total docs

第二個： -如果P（x_j | c_i）是#times x_j出現在D/#times x_j出現在c_i中。
後@larsmans注意到..

It is exactly occurrence of word in a document 
by total number of words in that class in whole dataset.

3： -假設一個x_j在訓練集不存在了，我給它的1的概率，使得其不會改變計算？

For That we have laplace correction or Additive smoothing. It is applied on 
p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize 
the effect not occurring features.

來源

2014-09-21 13:28:29 Devavrata

號，P（xⱼ|cᵢ）是類cᵢxⱼ的頻率，通過項的總數在類的所有文件分。 – 2014-09-21 14:06:22

@larsmans對不起，我沒有注意到.... – Devavrata 2014-09-22 17:20:29

基本概念：樸素貝葉斯算法的分類

回答

相關問題