換行符或標點符號在elasticsearch中的位置差距

在elasticsearch中，是否有一種方法可以設置一個分析器，當遇到換行符或標點符號時，會在令牌之間產生位置差距？換行符或標點符號在elasticsearch中的位置差距

比方說，我指數用下面的無意義的字符串的對象（具有換行符）作爲它的字段中的一個：

The quick brown fox runs after the rabbit. 
Then comes the jumpy frog.

標準分析器將產生下列標記與相應的位置：

0 the 
1 quick 
2 brown 
3 fox 
4 runs 
5 after 
6 the 
7 rabbit 
8 then 
9 comes 
10 the 
11 jumpy 
12 frog

這意味着the rabbit then comes的match_phrase查詢將匹配此文檔作爲命中。有沒有辦法引入rabbit和then之間的位置差距，以便它不匹配，除非引入了slop？

當然，解決方法可能是將單個字符串轉換爲數組（每個條目一行），並在字段映射中使用position_offset_gap，但我真的寧願使用換行符保留單個字符串（並且最終的解決方案涉及換行符比標點符號更大的位置差距）。

來源

2015-09-16 Shadocko

我終於想通了，使用char_filter對換行和標點符號引入額外的標記解決方案：

PUT /index 
{            
    "settings": { 
    "analysis": { 
     "char_filter": { 
     "my_mapping": { 
      "type": "mapping", 
      "mappings": [ ".=>\\n_PERIOD_\\n", "\\n=>\\n_NEWLINE_\\n" ] 
     } 
     }, 
     "analyzer": { 
     "my_analyzer": { 
      "tokenizer": "standard", 
      "char_filter": ["my_mapping"], 
      "filter": ["lowercase"] 
     } 
     } 
    } 
    } 
}

測試與示例串

POST /index/_analyze?analyzer=my_analyzer&pretty 
The quick brown fox runs after the rabbit. 
Then comes the jumpy frog.

產生以下結果：

{ 
    "tokens" : [ { 
    "token" : "the", 
    "start_offset" : 0, 
    "end_offset" : 3, 
    "type" : "<ALPHANUM>", 
    "position" : 1 
    }, { 
... snip ... 
    "token" : "rabbit", 
    "start_offset" : 35, 
    "end_offset" : 41, 
    "type" : "<ALPHANUM>", 
    "position" : 8 
    }, { 
    "token" : "_period_", 
    "start_offset" : 41, 
    "end_offset" : 41, 
    "type" : "<ALPHANUM>", 
    "position" : 9 
    }, { 
    "token" : "_newline_", 
    "start_offset" : 42, 
    "end_offset" : 42, 
    "type" : "<ALPHANUM>", 
    "position" : 10 
    }, { 
    "token" : "then", 
    "start_offset" : 43, 
    "end_offset" : 47, 
    "type" : "<ALPHANUM>", 
    "position" : 11 
... snip ... 
    } ] 
}

來源

2015-09-23 13:37:37 Shadocko

換行符或標點符號在elasticsearch中的位置差距

回答

相關問題