2012-11-13 39 views
9

我很難解決我的SOLR地址數據庫問題。SOLR 4.0按字母順序排序的麻煩

我從示例文件構建了這一個。我基本上使用修改後的模式運行示例配置。

schema.xml中

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="_version_" type="long" indexed="true" stored="true" required="false" multiValued="false" /> 

<field name="givenname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="middleinitial_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" /> 
<field name="surname_s" type="text_de" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="gender_s" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="pictureuri_s" type="string" indexed="false" stored="true" required="false" multiValued="false" /> 
<field name="function_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="organizationalunit_s" type="text_general" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="organizationalunitdescription_s" type="text_de" indexed="false" stored="true" required="false" multiValued="false" /> 
<field name="company_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="street_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="streetnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="postcode_s" type="int" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="city_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="building_s" type="text_de" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="roomnumber_s" type="int" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="country_s" type="text_en" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="countrycode_s" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="emailaddress_s" type="string" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="phone1_s" type="string" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="phone2_s" type="string" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="mobile_s" type="string" indexed="true" stored="true" required="false" multiValued="false" /> 
<field name="fax_s" type="string" indexed="true" stored="true" required="false" multiValued="false" /> 

我推約20.000隨機測試數據集類似下面post.jar填充數據庫:

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<add> 
    <doc> 
     <field name="id">1352498443_1</field> 
     <field name="givenname_s">Aynur</field> 
     <field name="middleinitial_s"/> 
     <field name="surname_s">Lehnen</field> 
     <field name="gender_s">F</field> 
     <field name="pictureuri_s">dummy_assets/female.jpg</field> 
     <field name="function_s">Zugschaffner/in</field> 
     <field name="organizationalunit_s">P 07</field> 
     <field name="organizationalunitdescription_s">Lorem Ipsum sadipscing voluptua ipsum invidunt dolor et dolore invidunt sed consetetur accusam dolore Lorem tempor.</field> 
     <field name="company_s">Lorem Lagna Epsum Emet</field> 
     <field name="street_s">Erlenweg</field> 
     <field name="streetnumber_s">82</field> 
     <field name="postcode_s">76297</field> 
     <field name="city_s">Lübeck</field> 
     <field name="building_s"/> 
     <field name="roomnumber_s">242</field> 
     <field name="country_s">GERMANY</field> 
     <field name="countrycode_s">DE</field> 
     <field name="emailaddress_s">[email protected]</field> 
     <field name="phone1_s">0392984823</field> 
     <field name="phone2_s">0124111417</field> 
     <field name="mobile_s">0325117132</field> 
     <field name="fax_s">0171459177</field> 
    </doc> 
</add> 

但是retreiving時數據我似乎有字母排序問題。考慮如下因素查詢:

{ 
    "responseHeader": { 
     "status": 0, 
      "QTime": 5, 
      "params": { 
      "sort": "surname_s asc", 
       "fl": "surname_s", 
       "indent": "true", 
       "wt": "json", 
       "q": "city_s:berlin" 
     } 
    }, 
     "response": { 
     "numFound": 1094, 
     "start": 0, 
     "docs": [{ 
      "surname_s": "Weil" 
     }, { 
      "surname_s": "Abel" 
     }, { 
      "surname_s": "Adam" 
     }, { 
      "surname_s": "Ade" 
     }, { 
      "surname_s": "Adrian" 
     }, { 
      "surname_s": "Aigner" 
     }, { 
      "surname_s": "Aigner" 
     }, { 
      "surname_s": "Alber" 
     }, { 
      "surname_s": "Alber" 
     }, { 
      "surname_s": "Albers" 
     }] 
    } 
} 

爲什麼是「威爾」上的一個位置,而其餘數據似乎正確排序?

回答

14

我相信在字段類型text_de中應用的一些附加分析器是導致此排序行爲的原因。根據我的經驗,對字符串排序時的最佳結果是使用下面顯示的示例schema.xml附帶的alphaOlySort fieldType。

<fieldType name="alphaOnlySort" class="solr.TextField" sortMissingLast="true" omitNorms="true"> 
    <analyzer> 
    <!-- KeywordTokenizer does no actual tokenizing, so the entire 
     input string is preserved as a single token 
     --> 
    <tokenizer class="solr.KeywordTokenizerFactory"/> 
    <!-- The LowerCase TokenFilter does what you expect, which can be 
     when you want your sorting to be case insensitive 
     --> 
    <filter class="solr.LowerCaseFilterFactory" /> 
    <!-- The TrimFilter removes any leading or trailing whitespace --> 
    <filter class="solr.TrimFilterFactory" /> 
    <!-- The PatternReplaceFilter gives you the flexibility to use 
     Java Regular expression to replace any sequence of characters 
     matching a pattern with an arbitrary replacement string, 
     which may include back references to portions of the original 
     string matched by the pattern. 

     See the Java Regular Expression documentation for more 
     information on pattern and replacement string syntax. 

     http://java.sun.com/j2se/1.6.0/docs/api/java/util/regex/package-summary.html 
     --> 
    <filter class="solr.PatternReplaceFilterFactory" 
      pattern="([^a-z])" replacement="" replace="all" 
    /> 
    </analyzer> 
</fieldType> 

我建議創建一個新的領域,然後通過copyField從surname_s複製值,類似下面:

<field name="surname_s_sort" type="alphaOnlySort" indexed="true" stored="false" required="false" multiValued="false" /> 

<copyField source="surname_s" dest="surname_s_sort"/> 

注:沒有任何需要存儲的值surname_s_sort字段,因此stored="false"屬性,除非您希望將其顯示給用戶。

然後你可以改變你的查詢來排序surname_s_sort

+1

對於有此問題的其他人請注意,CopyField發生在文檔編入索引時。 –

+2

你的假設是絕對正確的。 「weil」是GermanAnalyzer的一個停止詞。 –

+0

完美!感謝Paige Cook。有效。 – atpatil11

4

排序在多值和標記化字段上不起作用。

Documentation -
排序可以在文檔的「得分」來完成,或在任何多值=「假」索引=「真」字段提供的字段或者是非標記化(即:沒有分析儀)或使用僅生成單個字詞的分析器(即:使用KeywordTokenizer)

使用字符串作爲字段類型並將標題字段複製到新字段中。

<field name="surname_s_sort" type="string" indexed="true" stored="false"/> 

<copyField source="surname_s" dest="surname_s_sort" /> 

作爲@Paige的回答,您可以使用關鍵字標記,小寫過濾器不標記字段。

0

我有類似的問題,我嘗試了alphaOnlySort。這部分工作,但它開始弄亂排序結果時,該字段包含像 - ,/空格等值

所以結果是像

  1. /ABC
  2. AA
  3. /ABC2

所以我最終使用的字段類型小寫。它已經在那裏,所以我認爲它是一個默認類型。我確實使用了複製字段結構,所以我的最終配置是:

<schema> 
    <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100"> 
     <analyzer> 
     <tokenizer class="solr.KeywordTokenizerFactory"/> 
     <filter class="solr.LowerCaseFilterFactory" /> 
     </analyzer> 
    </fieldType> 
    <fields> 
     <field name="job_name_sort" type="lowercase" indexed="true" stored="false" required="false"/> 
    </fields> 
    <copyField source="job_name" dest="job_name_sort"/> 
</schema>