將所有文檔與Perl的Text中的其他文檔進行比較:: DocumentCollection

給定Perl中Text::DocumentCollection中的文檔集合，我想使用Text::Document來計算集合中任何兩個文檔之間的cosine similarity。將所有文檔與Perl的Text中的其他文檔進行比較:: DocumentCollection

我認爲這可以使用EnumerateV和回調來完成，但我無法弄清楚具體細節。（This SO question是有益的，但我仍然堅持。）

具體而言，假設集合存儲在test.db如下：

#!/usr/bin/perl -w 
use Text::DocumentCollection; 
use Text::Document; 

$c = Text::DocumentCollection->new(file => 'test.db'); 

my $text = 'Stack Overflow is a programming | Q & A site that’s free. Free to ask | questions, free to answer questions|, free to read, free to index'; 

my @strings = split /\|/, $text; 
my $i=0; 

foreach (@strings) { 
    my $doc = Text::Document->new(); 
    $doc->AddContent($_); 
    $c->Add(++$i,$doc); 
}

現在假設我需要test.db閱讀和計算餘弦相似度爲所有文件組合。（我沒有權限訪問在上面的代碼中創建的文件，除了通過存儲的數據庫文件創建的文件。）

我認爲答案是構建一個子程序，該子程序在EnumerateV中使用回調進行訪問，而我猜測該子程序也調用EnumerateV，但我一直無法弄清楚。

來源

2011-12-08 itzy

你可能要開始像這樣的東西：

$c->EnumerateV(sub { 
    my ($c, $k1, $d1) = @_; 
    $c->EnumerateV(sub { 
     my ($c, $k2, $d2) = @_; 
    return if exists $dist{$k1.$k2}; 
    $dist{$k1.$k2} = $dist{$k2.$k1}= cosine_dist($d1, $d2); 
    }); 
});

來源

2012-02-07 12:21:54 seano

將所有文檔與Perl的Text中的其他文檔進行比較:: DocumentCollection

回答

相關問題