2012-12-18 49 views
0

下面是我寫的一些代碼,用於計算使用樸素貝葉斯分類器相對於某些觀察特徵的標籤概率。這是爲了計算沒有平滑的樸素貝葉斯公式,並且打算計算實際的概率,所以請使用通常省略的分母。我遇到的問題是,對於示例(下面),「好」標籤的概率大於1.(1.30612245)任何人都可以幫助我理解這是什麼意思?這是天真假設的副產品嗎?Perl/Moose中的樸素貝葉斯計算

package NaiveBayes; 

use Moose; 

has class_counts => (is => 'ro', isa => 'HashRef[Int]', default => sub {{}}); 
has class_feature_counts => (is => 'ro', isa => 'HashRef[HashRef[HashRef[Num]]]', default => sub {{}}); 
has feature_counts => (is => 'ro', isa => 'HashRef[HashRef[Num]]', default => sub {{}}); 
has total_observations => (is => 'rw', isa => 'Num'); 

sub insert { 
    my($self, $class, $data) = @_; 
    $self->class_counts->{$class}++; 
    $self->total_observations(($self->total_observations||0) + 1); 
    for(keys %$data){ 
     $self->feature_counts->{$_}->{$data->{$_}}++; 
     $self->class_feature_counts->{$_}->{$class}->{$data->{$_}}++; 
    } 
    return $self; 
} 

sub classify { 
    my($self, $data) = @_; 
    my %probabilities; 
    my $feature_probability = 1; 
    for my $class(keys %{ $self->class_counts }) { 
     my $class_count = $self->class_counts->{$class}; 
     my $class_probability = $class_count/$self->total_observations; 
     my($feature_probability, $conditional_probability) = (1) x 2; 
     my(@feature_probabilities, @conditional_probabilities); 
     for(keys %$data){ 
      my $feature_count = $self->feature_counts->{$_}->{$data->{$_}}; 
      my $class_feature_count = $self->class_feature_counts->{$_}->{$class}->{$data->{$_}} || 0; 
      next unless $feature_count; 
      $feature_probability *= $feature_count/$self->total_observations; 
      $conditional_probability *= $class_feature_count/$class_count; 
     } 
     $probabilities{$class} = $class_probability * $conditional_probability/$feature_probability; 
    } 
    return %probabilities; 
} 

__PACKAGE__->meta->make_immutable; 
1; 

例子:

#!/usr/bin/env perl 

use Moose; 
use NaiveBayes; 

my $nb = NaiveBayes->new; 

$nb->insert('good' , {browser => 'chrome' ,host => 'yahoo' ,country => 'us'}); 
$nb->insert('bad' , {browser => 'chrome' ,host => 'slashdot' ,country => 'us'}); 
$nb->insert('good' , {browser => 'chrome' ,host => 'slashdot' ,country => 'uk'}); 
$nb->insert('good' , {browser => 'explorer' ,host => 'google' ,country => 'us'}); 
$nb->insert('good' , {browser => 'explorer' ,host => 'slashdot' ,country => 'ca'}); 
$nb->insert('good' , {browser => 'opera' ,host => 'google' ,country => 'ca'}); 
$nb->insert('good' , {browser => 'firefox' ,host => '4chan' ,country => 'us'}); 
$nb->insert('good' , {browser => 'opera' ,host => '4chan' ,country => 'ca'}); 

my %classes = $nb->classify({browser => 'opera', host => '4chan', country =>'uk'}); 

my @classes = sort { $classes{$a} <=> $classes{$b} } keys %classes; 

for(@classes){ 
    printf("%-20s : %5.8f\n", $_, $classes{$_}); 
} 

打印:

bad     : 0.00000000 
good     : 1.30612245 

林少擔心0的概率,但更多的是好> 1的 「概率」 我相信這是實現了經典的樸素貝葉斯定義。

p(C│F_1 ...F_n)=(p(C)p(F_1 |C)...p(F_n |C))/(p(F_1)...p(F_n)) 

這怎麼可能> 1呢?

回答

0

這段時間太長,因爲我正確地使用Perl來調試,但我想我可以看到問題出在哪裏。特徵向量p(f_1 ... f_n)的邊際概率不是按照你所做的方式來計算的,這是作爲一個具有單獨參數的獨立計算。相反,如果您有先驗p(c_1)和p(c_2)以及似然項p(f | c_1)和p(f | c_2)的類c_1和c_2,則f的邊際概率爲:

p (c_1)* p(f | c_1)+ p(c_2)* p(f | c_2)

這就是分母經常被丟棄的原因:它只涉及您已經使用的數量總和。任何你想了解的相對概率可以計算爲非標準化分數的比率,因此只有在你明確想要一個介於0和1之間的數字時才能計算出比例常數。