2013-09-24 72 views
1

我想使用solr的langid UpdateRequestProcessor。下面是配置:langid UpdateRequestProcessor只映射第一個字段

<updateRequestProcessorChain name="languages"> 
    <processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory"> 
     <lst name="invariants"> 
      <str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str> 
      <str name="langid.whitelist">en,fr</str> 
      <str name="langid.fallback">en</str> 
      <str name="langid.langField">detectedlang</str> 
      <bool name="langid.map">true</bool> 
      <bool name="langid.map.keepOrig">false</bool> 
     </lst> 
    </processor> 
    <processor class="solr.RunUpdateProcessorFactory" /> 
</updateRequestProcessorChain> 

我的領域是這樣的:

<fields> 
    <field name="_root_" type="string" indexed="true" stored="false"/> 
    <field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/> 

    <field name="id" type="string" indexed="true" stored="true" required="true" /> 

    <!-- raw fields from sql db --> 
    <field name="expertise_id" type="int" indexed="true" stored="true" /> 
    <field name="person_id" type="int" indexed="true" stored="true" /> 
    <field name="mod_date" type="date" indexed="true" stored="true" /> 
    <field name="lang" type="string" indexed="true" stored="true" /> 
    <field name="focus" type="text_general" indexed="true" stored="true" /> 
    <field name="expertise" type="text_general" indexed="true" stored="true" /> 
    <field name="platforms" type="text_general" indexed="true" stored="true" /> 
    <field name="partners" type="text_general" indexed="true" stored="true" /> 
    <field name="participation" type="text_general" indexed="true" stored="true" /> 
    <field name="additional" type="text_general" indexed="true" stored="true" /> 
    <field name="tag" type="text_general" termVectors="true" multiValued="true" />  
    <field name="facet_tag" type="string" stored="false" indexed="false" docValues="true" multiValued="true" default=""/> 

    <!-- language detected by solr --> 
    <field name="detectedlang" type="string" indexed="true" stored="true" /> 

    <!-- defined locale fields --> 
    <dynamicField name="*_en" type="text_en" indexed="true" stored="true" /> 
    <dynamicField name="*_fr" type="text_fr" indexed="true" stored="true" /> 

    <copyField source="tag" target="facet_tag"/> 

</fields> 

當我運行的更新或dataimport我知道,「語言」更新鏈的使用,因爲focus被映射到focus_en並檢測到lang被設置。但是,langid.fl中的其他字段都沒有映射。爲什麼?

一個例子更新查詢:

{ 
    "additional": "here is some other information about me.", 
    "expertise_id": "10000", 
    "id": "foo_10000", 
    "focus": "this is my new focus. It is very exciting. When I am done I expect to be super experienced." 
} 

這裏是expertise_id=10000查詢的結果。需要注意的是additional沒有被移動到additional_en

"response":{"numFound":1,"start":0,"docs":[ 
     { 
     "additional":"here is some other information about me.", 
     "expertise_id":10000, 
     "id":"foo_10000", 
     "detectedlang":"en", 
     "focus_en":"this is my new focus. It is very exciting. When I am done I expect to be super experienced.", 
     "_version_":1447088846110982144}] 
    } 
+0

請參閱https://wiki.apache.org/solr/LanguageDetection#Caveats。 '由於這些實現使用基於n-gram的方法進行檢測,因此它們很容易在特別短的輸入上檢測不到。「您是否嘗試使用更長的文本? – arun

+0

@arun:爲了測試長度可能成爲問題的想法,我只是添加了一個文檔,其中所有映射字段具有相同的200字英文文本。 'focus'被映射到'focus_en'。沒有其他人被映射。 – dnagirl

+0

@dnagirl,是否提供瞭解決方案? – forguta

回答

1

原來,這個問題是一個語法錯誤。這條線:

<str name="langid.fl">focus, expertise, platforms, partners, participation, additional</str> 

必須

<str name="langid.fl">focus,expertise,platforms,partners,participation,additional</str> 

docs狀態字段列表應該是逗號或空格分隔值。很明顯,逗號和空格會將事情搞砸(儘管在其他Solr上下文中可以正常工作,例如,在requestHandler中langid.fl被假設爲建模)。我嘗試了空格分隔的語法,但它沒有解決我的問題。

我希望這可以幫助別人。

+1

嗯,我把它作爲你昨天嘗試下一件事的評論,但認爲它太愚蠢,所以沒有發佈:)。 – arun