2016-07-25 24 views
2

我有一個名爲summary的字符串屬性,它的analyzer設置爲trigramssearch_analyzer設置爲words在ElasticSearch中處理點

"filter": { 
    "words_splitter": { 
     "type": "word_delimiter", 
     "preserve_original": "true" 
    }, 
    "english_words_filter": { 
     "type": "stop", 
     "stop_words": "_english_" 
    }, 
    "trigrams_filter": { 
     "type": "ngram", 
     "min_gram": "2", 
     "max_gram": "20" 
    } 
}, 
"analyzer": { 
    "words": { 
     "filter": [ 
      "lowercase", 
      "words_splitter", 
      "english_words_filter" 
     ], 
     "type": "custom", 
     "tokenizer": "whitespace" 
    }, 
    "trigrams": { 
     "filter": [ 
      "lowercase", 
      "words_splitter", 
      "trigrams_filter", 
      "english_words_filter" 
     ], 
     "type": "custom", 
     "tokenizer": "whitespace" 
    } 
} 

我需要在給定的輸入像React and HTML(或React, html),其查詢字符串被匹配到包含在summary的話Reactreactjsreact.jshtmlhtml5文件。隨着他們擁有更多的匹配關鍵詞,他們擁有更高的分數(理想情況下,我希望文檔中的分數只有低於100%的單詞匹配率)。

事情是,我猜這一刻react.jsreactjs中都被拆分,因爲我得到的所有文檔也包含js。另一方面,Reactjs什麼都不返回。我也認爲需要words_splitter才能忽略逗號。

回答

0

我找到了解決方案。

基本上我要與catenate_all活躍

"words_splitter": { 
    "catenate_all": "true", 
    "type": "word_delimiter", 
    "preserve_original": "true" 
} 

其與keyword標記者

"words": { 
    "filter": [ 
     "words_splitter" 
    ], 
    "type": "custom", 
    "tokenizer": "keyword" 
} 

調用http://localhost:9200/sample_index/_analyze?analyzer=words&pretty=true&text=react.js我得到以下標記給人以words分析定義word_delimiter過濾器:

{ 
"tokens": [ 
    { 
     "token": "react.js", 
     "start_offset": 0, 
     "end_offset": 8, 
     "type": "word", 
     "position": 0 
    }, 
    { 
     "token": "react", 
     "start_offset": 0, 
     "end_offset": 5, 
     "type": "word", 
     "position": 0 
    }, 
    { 
     "token": "reactjs", 
     "start_offset": 0, 
     "end_offset": 8, 
     "type": "word", 
     "position": 0 
    }, 
    { 
     "token": "js", 
     "start_offset": 6, 
     "end_offset": 8, 
     "type": "word", 
     "position": 1 
    } 
    ] 
} 
1

你可以用react.js這樣的名稱來解決問題,使用關鍵字標記過濾器並定義分析器以便使用關鍵字過濾器。這將阻止react.js被分成反應js令牌。

下面是過濾一個示例配置:

 "filter": { 
     "keywords": { 
      "type": "keyword_marker", 
      "keywords": [ 
       "react.js", 
      ] 
     } 
    } 

而且分析

 "analyzer": { 
     "main_analyzer": { 
      "type": "custom", 
      "tokenizer": "standard", 
      "filter": [ 
       "lowercase", 
       "keywords", 
       "synonym_filter", 
       "german_stop", 
       "german_stemmer" 
      ] 
     } 
    } 

您可以查看是否爲使用analyze命令需要您的分析儀的行爲:

GET /<index_name>/_analyze?analyzer=main_analyzer&text="react.js is a nice library" 

這應該返回以下令牌哪裏react.js沒有標記化:

{ 
    "tokens": [ 
     { 
     "token": "react.js", 
     "start_offset": 1, 
     "end_offset": 9, 
     "type": "<ALPHANUM>", 
     "position": 0 
     }, 
     { 
     "token": "is", 
     "start_offset": 10, 
     "end_offset": 12, 
     "type": "<ALPHANUM>", 
     "position": 1 
     }, 
     { 
     "token": "a", 
     "start_offset": 13, 
     "end_offset": 14, 
     "type": "<ALPHANUM>", 
     "position": 2 
     }, 
     { 
     "token": "nice", 
     "start_offset": 15, 
     "end_offset": 19, 
     "type": "<ALPHANUM>", 
     "position": 3 
     }, 
     { 
     "token": "library", 
     "start_offset": 20, 
     "end_offset": 27, 
     "type": "<ALPHANUM>", 
     "position": 4 
     } 
    ] 
} 

對於相似但不完全相同的話:React.jsReactjs你可以使用一個同義詞過濾器。你有一組固定的關鍵字,你想匹配?

+0

文檔和搜索查詢都未預定義。沒有什麼我可以硬編碼。我正在研究搜索引擎。 –

+0

我正在考慮的事情基本上是創建一個過濾器,爲像「react.js」這樣的單詞創建一個不包含點的同義詞。通過這種方式,兩種變體都被接受。不幸的是我在文檔中找不到任何這樣做的方法。 –