2015-09-16 100 views
8

在elasticsearch中,是否有一種方法可以設置一個分析器,當遇到換行符或標點符號時,會在令牌之間產生位置差距?換行符或標點符號在elasticsearch中的位置差距

比方說,我指數用下面的無意義的字符串的對象(具有換行符)作爲它的字段中的一個:

The quick brown fox runs after the rabbit. 
Then comes the jumpy frog. 

標準分析器將產生下列標記與相應的位置:

0 the 
1 quick 
2 brown 
3 fox 
4 runs 
5 after 
6 the 
7 rabbit 
8 then 
9 comes 
10 the 
11 jumpy 
12 frog 

這意味着the rabbit then comesmatch_phrase查詢將匹配此文檔作爲命中。 有沒有辦法引入rabbitthen之間的位置差距,以便它不匹配,除非引入了slop

當然,解決方法可能是將單個字符串轉換爲數組(每個條目一行),並在字段映射中使用position_offset_gap,但我真的寧願使用換行符保留單個字符串(並且最終的解決方案涉及換行符比標點符號更大的位置差距)。

回答

6

我終於想通了,使用char_filter對換行和標點符號引入額外的標記解決方案:

PUT /index 
{            
    "settings": { 
    "analysis": { 
     "char_filter": { 
     "my_mapping": { 
      "type": "mapping", 
      "mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ] 
     } 
     }, 
     "analyzer": { 
     "my_analyzer": { 
      "tokenizer": "standard", 
      "char_filter": ["my_mapping"], 
      "filter": ["lowercase"] 
     } 
     } 
    } 
    } 
} 

測試與示例串

POST /index/_analyze?analyzer=my_analyzer&pretty 
The quick brown fox runs after the rabbit. 
Then comes the jumpy frog. 

產生以下結果:

{ 
    "tokens" : [ { 
    "token" : "the", 
    "start_offset" : 0, 
    "end_offset" : 3, 
    "type" : "<ALPHANUM>", 
    "position" : 1 
    }, { 
... snip ... 
    "token" : "rabbit", 
    "start_offset" : 35, 
    "end_offset" : 41, 
    "type" : "<ALPHANUM>", 
    "position" : 8 
    }, { 
    "token" : "_period_", 
    "start_offset" : 41, 
    "end_offset" : 41, 
    "type" : "<ALPHANUM>", 
    "position" : 9 
    }, { 
    "token" : "_newline_", 
    "start_offset" : 42, 
    "end_offset" : 42, 
    "type" : "<ALPHANUM>", 
    "position" : 10 
    }, { 
    "token" : "then", 
    "start_offset" : 43, 
    "end_offset" : 47, 
    "type" : "<ALPHANUM>", 
    "position" : 11 
... snip ... 
    } ] 
}