2012-10-18 70 views
1

我有以下XML文件:Marklogic:分數計算

<?xml version="1.0" encoding="UTF-8"?> 
<data> 
<text>We are a doing nothing here you can say it time pass. what are you doing doing doing doing doing time?</text> 
<text>We are a doing nothing here you can say it time pass. what are you doing doing doing doing doing time?</text> 
</data> 

現在我執行以下查詢:

let $hits := 
let $terms := 
let $node := xdmp:document-filter(doc("/content/C/Documents and Settings/vimleshm/Desktop/abc.xml")) 
return 
(cts:distinctive-terms($node, 
<options xmlns="cts:distinctive-terms" 
xmlns:db="http://marklogic.com/xdmp/database"> 
<use-db-config>false</use-db-config> 
<score>logtf</score> 
<max-terms>100</max-terms> 
<db:word-searches>true</db:word-searches> 
<db:stemmed-searches>off</db:stemmed-searches> 
<db:fast-phrase-searches>false</db:fast-phrase-searches> 
<db:fast-element-word-searches>false</db:fast-element-word-searches> 
<db:fast-element-phrase-searches>false</db:fast-element-phrase-searches> 
</options>)//cts:term) 
for $wq in $terms 
where $wq/cts:word-query 
return element word { 
attribute score {        $wq/@score}, 
$wq/cts:word-query/cts:text/string() } 
return 

let $x:= 
for $hit in $hits 

return $hit 
return $x 

它給了我下面的響應:

<?xml version="1.0" encoding="UTF-8"?> 
<results warning="more than one root item"> 
    <word score="36864">doing</word> 
    <word score="26624">text</word> 
    <word score="26624">you</word> 
    <word score="26624">time</word> 
    <word score="26624">are</word> 
    <word score="22528">a</word> 
    <word score="22528">we</word> 
    <word score="22528">it</word> 
    <word score="22528">data</word> 
    <word score="22528">can</word> 
    <word score="22528">pass</word> 
    <word score="22528">here</word> 
    <word score="22528">nothing</word> 
    <word score="22528">what</word> 
    <word score="22528">say</word> 
</results> 

會有人告訴我這個分數[log(詞頻)]是如何計算的?在上面的例子中,舉例來說,在總共42個單詞中「做」12次。

以下是總的術語和頻率[在托架給出用於上述文件

doing [12] 
you [4] 
time [4] 
are [4] 
a [2] 
We [2] 
nothing [2] 
here [2] 
can [2] 
say [2] 
it [2] 
pass [2] 
what [2] 

回答

5

http://docs.marklogic.com/guide/search-dev/relevance#chapter當然是最好的地方開始。 還有比logTF更多的東西。還有:

  • IDF - 數據庫中的這些詞有多常見?
  • 文件長度正常化 - 較長的文檔往往比短文件一個字的關係,這樣的比分被縮小了文件長度
  • 和logTF實際上是一個階梯式TF函數的自然對數(速度)

所有這些東西一起工作,使得分數準確但快速。