與點

2015-02-10 9 views
2

字母值Elasticsearch分析令牌我有一個文本字段已經當我檢查使用分析API(默認分析)這個值 -與點

term1-term2-term3-term4-term5-RWHPSA951000155.2013-05-27.log 

,我得到這個 -

{ 
    "tokens": [ 
     { 
      "token": "text", 
      "start_offset": 2, 
      "end_offset": 6, 
      "type": "<ALPHANUM>", 
      "position": 1 
     }, 
     { 
      "token": "term1", 
      "start_offset": 9, 
      "end_offset": 14, 
      "type": "<ALPHANUM>", 
      "position": 2 
     }, 
     { 
      "token": "term2", 
      "start_offset": 15, 
      "end_offset": 20, 
      "type": "<ALPHANUM>", 
      "position": 3 
     }, 
     { 
      "token": "term3", 
      "start_offset": 21, 
      "end_offset": 26, 
      "type": "<ALPHANUM>", 
      "position": 4 
     }, 
     { 
      "token": "term4", 
      "start_offset": 27, 
      "end_offset": 32, 
      "type": "<ALPHANUM>", 
      "position": 5 
     }, 
     { 
      "token": "term5", 
      "start_offset": 33, 
      "end_offset": 38, 
      "type": "<ALPHANUM>", 
      "position": 6 
     }, 
     { 
      "token": "rwhpsa951000155.2013", 
      "start_offset": 39, 
      "end_offset": 59, 
      "type": "<ALPHANUM>", 
      "position": 7 
     }, 
     { 
      "token": "05", 
      "start_offset": 60, 
      "end_offset": 62, 
      "type": "<NUM>", 
      "position": 8 
     }, 
     { 
      "token": "27", 
      "start_offset": 63, 
      "end_offset": 65, 
      "type": "<NUM>", 
      "position": 9 
     }, 
     { 
      "token": "log", 
      "start_offset": 66, 
      "end_offset": 69, 
      "type": "<ALPHANUM>", 
      "position": 10 
     } 
    ] 
} 

我特別好奇這個令牌 - rwhpsa951000155.2013。這是怎麼發生的?目前我的搜索匹配RWHPSA951000155因此失敗。我如何才能將它識別爲單獨的令牌RWHPSA9510001552013

請注意,如果值爲term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log,那麼它會將和2013拆分爲單獨的標記。所以這與951000155有關。

謝謝,

回答

7

Standard Analyzer被標記化rwhpsa951000155.2013作爲產品數量。在連字符

拆分詞語,除非有一個號碼在令牌,在 這種情況下,整個令牌被解釋爲一個產品編號和是 不分裂。

您可以添加模式分析器來替換'。'與一個空白空間。默認的分析器會按照你想要的方式標記這個術語。

/POST test 
{ 
    "settings": { 
     "index": { 
      "analysis": { 
       "char_filter": { 
        "my_pattern": { 
         "type": "pattern_replace", 
         "pattern": "\\.", 
         "replacement": " " 
        } 
       }, 
       "analyzer": { 
        "my_analyzer": { 
         "tokenizer": "standard", 
         "char_filter": [ 
          "my_pattern" 
         ] 
        } 
       } 
      } 
     } 
    }, 
    "mappings": { 
     "my_type": { 
      "properties": { 
       "test": { 
        "type": "string", 
        "analyzer": "my_analyzer" 
       } 
      } 
     } 
    } 
} 

調用API分析:

curl -XGET 'localhost:9200/test/_analyze?analyzer=my_analyzer&pretty=true' -d 'term1-term2-term3-term4-term5-RWHPSA.2013-05-27.log' 

返回:

{ 
    "tokens" : [ { 
    "token" : "term1", 
    "start_offset" : 0, 
    "end_offset" : 5, 
    "type" : "<ALPHANUM>", 
    "position" : 1 
    }, { 
    "token" : "term2", 
    "start_offset" : 6, 
    "end_offset" : 11, 
    "type" : "<ALPHANUM>", 
    "position" : 2 
    }, { 
    "token" : "term3", 
    "start_offset" : 12, 
    "end_offset" : 17, 
    "type" : "<ALPHANUM>", 
    "position" : 3 
    }, { 
    "token" : "term4", 
    "start_offset" : 18, 
    "end_offset" : 23, 
    "type" : "<ALPHANUM>", 
    "position" : 4 
    }, { 
    "token" : "term5", 
    "start_offset" : 24, 
    "end_offset" : 29, 
    "type" : "<ALPHANUM>", 
    "position" : 5 
    }, { 
    "token" : "RWHPSA951000155", 
    "start_offset" : 30, 
    "end_offset" : 45, 
    "type" : "<ALPHANUM>", 
    "position" : 6 
    }, { 
    "token" : "2013", 
    "start_offset" : 46, 
    "end_offset" : 50, 
    "type" : "<NUM>", 
    "position" : 7 
    }, { 
    "token" : "05", 
    "start_offset" : 51, 
    "end_offset" : 53, 
    "type" : "<NUM>", 
    "position" : 8 
    }, { 
    "token" : "27", 
    "start_offset" : 54, 
    "end_offset" : 56, 
    "type" : "<NUM>", 
    "position" : 9 
    }, { 
    "token" : "log", 
    "start_offset" : 57, 
    "end_offset" : 60, 
    "type" : "<ALPHANUM>", 
    "position" : 10 
    } ] 
} 
+0

謝謝!我一直在搜索ES文檔以瞭解標準分析儀的工作原理,從未想過查看lucene的文檔。順便說一下,我可以在映射中指定一個分析器到一個字段。我如何使分析儀成爲索引的默認值? ES文檔說 - 「'default_index'邏輯名稱可以用來配置將在索引時使用的默認分析器」。這會起作用嗎?我已經定義了映射,並在索引中有數據。我怎樣才能單獨更新默認分析儀? – ksrini 2015-02-10 10:13:16

+0

是的,它會工作。在上面的設置示例中,將'my_analyzer'更改爲'default_index'。更新分析器看看這個鏈接:http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-update-settings.html#update-settings-analysis – 2015-02-10 10:27:38

+0

我試過了,看起來像什麼出錯了。當我將分析API與所提到的文本一起使用時,我現在得到'{「標記」:[]}'。當我執行'curl -XGET'http:// localhost:9200/myindex/_settings?pretty''時,我可以看到新的分析器。如何刪除所有分析儀和過濾器並恢復到舊狀態?之前我從未在設置中使用過分析{}部分。 – ksrini 2015-02-10 11:00:12