2011-11-08 40 views
4

我有以下的記錄,並針對它的分數,當我搜索 「iphone」 -Solr的得分 - fieldnorm

記錄1: 字段名 - 顯示名稱: 「iPhone」 字段名 - 名稱: 「iPhone」

11.654595 = (MATCH) sum of: 
    11.654595 = (MATCH) max plus 0.01 times others of: 
    7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of: 
     0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 
     10.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of: 
     1.0 = tf(termFreq(DisplayName:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     1.0 = fieldNorm(field=DisplayName, doc=915195) 
    11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of: 
     0.99820393 = queryWeight(Name:iphone^15.0), product of: 
     15.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of: 
     1.0 = tf(termFreq(Name:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     1.0 = fieldNorm(field=Name, doc=915195) 

RECORD2: 字段名 - 顯示名稱: 「iPhone的書」 字段名 - 名稱: 「iPhone的書」

7.284122 = (MATCH) sum of: 
    7.284122 = (MATCH) max plus 0.01 times others of: 
    4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of: 
     0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 
     10.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of: 
     1.0 = tf(termFreq(DisplayName:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.625 = fieldNorm(field=DisplayName, doc=453681) 
    7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of: 
     0.99820393 = queryWeight(Name:iphone^15.0), product of: 
     15.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of: 
     1.0 = tf(termFreq(Name:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.625 = fieldNorm(field=Name, doc=453681) 

RECORD3: 字段名 - 顯示名稱: 「iPhone」 字段名 - 名稱: 「iPhone」

7.284122 = (MATCH) sum of: 
    7.284122 = (MATCH) max plus 0.01 times others of: 
    4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of: 
     0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 
     10.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of: 
     1.0 = tf(termFreq(DisplayName:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.625 = fieldNorm(field=DisplayName, doc=5737775) 
    7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of: 
     0.99820393 = queryWeight(Name:iphone^15.0), product of: 
     15.0 = boost 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.0057376726 = queryNorm 
     7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of: 
     1.0 = tf(termFreq(Name:iphone)=1) 
     11.598244 = idf(docFreq=484, maxDocs=19431244) 
     0.625 = fieldNorm(field=Name, doc=5737775) 

爲什麼RECORD2和RECORD3具有相同分數時RECORD2有3個字和RECORD3只有一個字。因此Record3應該具有比記錄2更高的相關性。爲什麼Record2和Record3的fieldNorm相同?

的QueryParser:Dismax 的FieldType:文本字段類型默認solrconfig.xml中

添加數據傳送專線:

記錄1:iPhone

{ 
     "ListPrice":1184.526, 
     "ShipsTo":1, 
     "OID":"190502", 
     "EAN":"9780596804299", 
     "ISBN":"0596804296", 
     "Author":"Pogue, David", 
     "product_type_fq":"Books", 
     "ShipmentDurationDays":"21", 
     "CurrencyValue":"24.9900", 
     "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", 
     "Availability":0, 
     "COD":0, 
     "PublicationDate":"2009-08-07 00:00:00.0", 
     "Discount":"25", 
     "SubCategory_fq":"Hardware", 
     "Binding":"Paperback", 
     "Category_fq":"Non Classifiable", 
     "ShippingCharges":"0", 
     "OIDType":8, 
     "Pages":"397", 
     "CallOrder":"0", 
     "TrackInventory":"Ingram", 
     "Author_fq":"Pogue, David", 
     "DisplayName":"Iphone", 
     "url":"/iphone-pogue-david/books/9780596804299.htm", 
     "CurrencyType":"USD", 
     "SubSubCategory":"Handheld Devices", 
     "Mask":0, 
     "Publisher":"Oreilly & Associates Inc", 
     "Name":"Iphone", 
     "Language":"English", 
     "DisplayPriority":"999", 
     "rowid":"books_9780596804299" 
     } 

RECORD2:iPhone的書

{ 
     "ListPrice":1184.526, 
     "ShipsTo":1, 
     "OID":"94694", 
     "EAN":"9780321534101", 
     "ISBN":"0321534107", 
     "Author":"Kelby, Scott/ White, Terry", 
     "product_type_fq":"Books", 
     "ShipmentDurationDays":"21", 
     "CurrencyValue":"24.9900", 
     "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", 
     "Availability":1, 
     "COD":0, 
     "PublicationDate":"2007-08-13 00:00:00.0", 
     "Discount":"25", 
     "SubCategory_fq":"Handheld Devices", 
     "Binding":"Paperback", 
     "BAMcategory_src":"Computers", 
     "Category_fq":"Computers", 
     "ShippingCharges":"0", 
     "OIDType":8, 
     "Pages":"219", 
     "CallOrder":"0", 
     "TrackInventory":"Ingram", 
     "Author_fq":"Kelby, Scott/ White, Terry", 
     "DisplayName":"The Iphone Book", 
     "url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm", 
     "CurrencyType":"USD", 
     "SubSubCategory":" Handheld Devices", 
     "BAMcategory_fq":"Computers", 
     "Mask":0, 
     "Publisher":"Pearson P T R", 
     "Name":"The Iphone Book", 
     "Language":"English",   
     "DisplayPriority":"999", 
     "rowid":"books_9780321534101" 
     } 

記錄3:iPhone

{ 
     "ListPrice":278.46, 
     "ShipsTo":1, 
     "OID":"694715", 
     "EAN":"9781411423527", 
     "ISBN":"1411423526", 
     "Author":"Quamut (COR)", 
     "product_type_fq":"Books", 
     "ShipmentDurationDays":"21", 
     "CurrencyValue":"5.9500", 
     "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", 
     "Availability":0, 
     "COD":0, 
     "PublicationDate":"2010-08-03 00:00:00.0", 
     "Discount":"25", 
     "SubCategory_fq":"Hardware", 
     "Binding":"Paperback", 
     "Category_fq":"Non Classifiable", 
     "ShippingCharges":"0", 
     "OIDType":8, 
     "CallOrder":"0",   
     "TrackInventory":"BNT", 
     "Author_fq":"Quamut (COR)", 
     "DisplayName":"iPhone", 
     "url":"/iphone-quamut-cor/books/9781411423527.htm", 
     "CurrencyType":"USD", 
     "SubSubCategory":"Handheld Devices", 
     "Mask":0, 
     "Publisher":"Sterling Pub Co Inc", 
     "Name":"iPhone", 
     "Language":"English", 
     "DisplayPriority":"999", 
     "rowid":"books_9781411423527" 
     }   
+0

沒有看到記錄1和記錄3之間的任何差異,所以我看不出結果會有什麼不同。你能否確認分數屬於記錄3,並提供查詢解析器,字段類型和數據的更多信息? – Jayendra

+0

嗨我正在使用dismax查詢分析器和分數屬於記錄3.我已交叉檢查,所有3記錄的得分是我所提到的。字段類型是在solrconfig.xml中定義的默認「文本」字段類型。 – user1021590

+0

您是否也可以提供數據Feed? – Jayendra

回答

5

fieldnorm考慮了字段長度,即術語數。
使用的字段類型爲字段的文本顯示名稱&名稱,該名稱包含停用詞和單詞分隔符過濾器。

記錄1 - Iphone
將產生單令牌 - IPhone

記錄2 - The Iphone Book
將產生2個令牌 - Iphone, Book
的將由停用詞被移除。

記錄3 - iPhone
也將產生2個令牌 - i,phone
作爲iPhone有情況變化,splitOnCaseChange字分隔符過濾器現在將拆分的iPhone變成2個令牌我,電話和將產生字段標準相同Record 2

+0

嗨Jayendra,謝謝..有道理。但對於另一個數據集,我得到錯誤的相關性。記錄1:「真正的達芬奇密碼」和記錄2:「達芬奇密碼」。當我搜索「da vinci代碼」時,由於fieldNorm是相同的,我得到了兩個記錄的相同分數。它使用相同的字段類型「文本」。使用SolrAnalysis我看到Record1被更改爲「真正的達芬奇密碼」,並將record2更改爲「達芬奇密碼」(索引時間)。那麼爲什麼這兩個分數是相同的。 – user1021590

+0

你能提供數據字段和調試分數嗎? – Jayendra

+0

嗨Jayendra,數據字段是「文本」。我已將調試分數添加爲答案的一部分。 – user1021590

3

這是「da vinci code」搜索示例中user1021590的後續問題/答案的答案。

所有文檔得分相同的原因是由於lengthNorm的細微實現細節。Lucence TFIDFSimilarity doc聲明如下約norm(t, d)

結果標準值在存儲之前被編碼爲單個字節。在搜索時,從索引目錄中讀取標準字節值並將其解碼回浮點標準值。這種編碼/解碼在減小索引尺寸的同時,會帶來精確損失的代價 - 不能保證decode(encode(x))= x。例如,解碼(編碼(0.89))= 0.75。

如果深入到代碼,您會看到如下這個float到字節編碼實現:

public static float byte315ToFloat(byte b) 
{ 
    if (b == 0) 
     return 0.0f; 
    int bits = (b & 0xff) << (24 - 3); 
    bits += (63 - 15) << 24; 
    return Float.intBitsToFloat(bits); 
} 

public static byte floatToByte315(float f) 
{ 
    int bits = Float.floatToRawIntBits(f); 
    int smallfloat = bits >> (24 - 3); 
    if (smallfloat <= ((63 - 15) << 3)) 
    { 
     return (bits <= 0) ? (byte) 0 : (byte) 1; 
    } 
    if (smallfloat >= ((63 - 15) << 3) + 0x100) 
    { 
     return -1; 
    } 
    return (byte) (smallfloat - ((63 - 15) << 3)); 
} 

和字節漂浮的解碼爲已完成

lengthNorm計算爲1/sqrt(number of terms in field)。然後使用floatToByte315對其進行編碼存儲。對於3項的領域,我們得到:

floatToByte315(1/sqrt(3.0)) = 120

,並用4個學期領域,我們得到:

floatToByte315(1/sqrt(4.0)) = 120

所以他們都得到解碼:

byte315ToFloat(120) = 0.5

該文檔還指出這一點:

的理由支持範數值等有損壓縮,鑑於用戶的難度(和不準確)來表達他們的查詢需要的真實信息,只有大的差異至關重要。

更新:自Solr 4.10起,此實現和相應的語句是DefaultSimilarity的一部分。

+0

嘿,我面臨同樣的問題..對於3&4條款文件,fieldNorm是相同的,從而影響搜索結果。有沒有解決方案? –

+0

我將超越lengthNorm類的方法,並嘗試在3和4長度文檔的情況下特別處理。 –