你怎麼看待Pattern tokenizer?我創建一個正則表達式來將字符串拆分爲令牌(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))
。從那以後,我創建了一個分析儀這樣的:
PUT /myindex
{
"settings": {
"analysis": {
"analyzer": {
"codeanalyzer": {
"type": "pattern",
"pattern":"(?<=(^\\w{4}))|(?<=^\\w{4}(\\w{3}))|(?<=^\\w{4}\\w{3}(\\w{1}))|(?<=^\\w{4}\\w{3}\\w{1}(\\w{2}))"
}
}
}
}
}
POST /myindex/_analyze?analyzer=codeanalyzer&text=ABCD1E2F34
,其結果是標記化數據:
{
"tokens": [
{
"token": "abcd",
"start_offset": 0,
"end_offset": 4,
"type": "word",
"position": 0
},
{
"token": "1e2",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 1
},
{
"token": "f",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 2
},
{
"token": "34",
"start_offset": 8,
"end_offset": 10,
"type": "word",
"position": 3
}
]
}
您可以查看文檔也:https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html