2017-08-27 36 views
0

我在我的數據庫中的這些價值觀爲title領域:Solr的:計算方面的數字時忽略字符串的外殼

"I Am A String" 
"I am A string" 

我希望可在我的搜索結果方面的標題字段。

當前的結果:

<lst name="title"> 
    <int name="I Am A String">4</int> 
    <int name="I am A string">3</int> 
</lst> 

期望的結果:

<lst name="title"> 
    <int name="I Am A String">7</int> 
</lst> 

我其實不關心其中2個可用的選項字符串被選擇爲最終結果,只要相同字符串(不區分大小寫)針對同一方面進行計數。

我嘗試了title字段的以下字段定義。我還添加了由此產生的方面邏輯。

串=看到套管作爲不同的字符串
string_exact =看到套管作爲不同的字符串
text_ws =分解成單詞與外殼完好
文本=斷裂成單獨的詞
textTight =斷裂成單獨的詞
textTrue =在口頭上打破了與外殼完好
string_exacttest =在口頭上打破了與外殼完好

這裏是我的schema.xml

<field name="title" type="string" indexed="true" stored="true"/> 


<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" /> 

<fieldType name="string_exact" class="solr.TextField" 
    sortMissingLast="true" omitNorms="true"> 
    <analyzer> 
     <tokenizer class="solr.KeywordTokenizerFactory"/>   
    </analyzer> 
</fieldType>  

<fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100"> 
    <analyzer> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    </analyzer> 
</fieldType> 

<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of words on case-change, alpha numeric boundaries, and non-alphanumeric chars, so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi". 
    Synonyms and stopwords are customized by external files, and stemming is enabled. Duplicate tokens at the same position (which may result from Stemmed Synonyms or WordDelim parts) are removed.--> 
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> 
    <analyzer type="index"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/> 
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> 
    <filter class="solr.LowerCaseFilterFactory"/> 
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
    </analyzer> 
    <analyzer type="query"> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt"/> 
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/> 
    <filter class="solr.LowerCaseFilterFactory"/> 
    <!--<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>--> 
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
    </analyzer> 
</fieldType> 


<!-- Less flexible matching, but less false matches. Probably not ideal for product names,but may be good for SKUs. Can insert dashes in the wrong place and still match. --> 
<fieldType name="textTight" class="solr.TextField" positionIncrementGap="100" > 
    <analyzer> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" /> 
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> 
    <filter class="solr.ASCIIFoldingFilterFactory"/> 
    <filter class="solr.LowerCaseFilterFactory"/> 
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/> 
    <!-- 
     this filter can remove any duplicate tokens that appear at the same position - sometimes possible with WordDelimiterFilter in conjuncton with 
     stemming. 
    --> 
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
    </analyzer> 
</fieldType> 


<fieldType name="textTrue" class="solr.TextField" positionIncrementGap="100" > 
    <analyzer> 
    <tokenizer class="solr.WhitespaceTokenizerFactory"/> 
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_dutch.txt" /> 
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0"/> 
    <filter class="solr.ASCIIFoldingFilterFactory"/> 
    <filter class="solr.SnowballPorterFilterFactory" language="Dutch" protected="protwords.txt"/> 
    </analyzer> 
</fieldType>  

如何確保在計算facet時將相同的字符串(忽略大小寫)分組在一起?

回答

1

string_exact定義幾乎是你所需要的,但你也需要應用一個LowercaseFilter,這樣每個句子都是小寫的。 KeywordTokenizer將整個值保存爲單個標記(因此您不會將其看作基於空白的單獨術語),而字符串字段不允許進行任何其他處理,但帶有KeywordTokenizer的TextField的行爲方式相同 - 但是您可以添加過濾器以便後續處理令牌。

<fieldType name="string_facet" class="solr.TextField" sortMissingLast="true" omitNorms="true"> 
    <analyzer> 
     <tokenizer class="solr.KeywordTokenizerFactory"/>  
     <filter class="solr.LowerCaseFilterFactory"/>  
    </analyzer> 
</fieldType>  
+0

嗨,謝謝。但我不希望它被小型化。我想保護套管。請參閱帖子本身,因爲我知道這些字符串是以不同的方式放置的,所以我想選擇任何一種不同的套用變體並使用它。我會怎麼做呢? – Flo

+1

但是你實際上說你想要那個 - 你希望計數不受套管的影響,這意味着你必須用相同的套管索引這些標記。如果你不這樣做,它們將是不同的標記,因此,計數方式不同。我會根據字段的規則(例如正常句子大小寫(我是一個..)),在爲方面字段建立索引時規範化外殼,或者我認爲您必須使用第一個版本並迭代結果並手動合併每個條目(..或獲取兩個方面,並從第二個方面查找大寫版本,第一個更有效) – MatsLindh