Elasticsearch搜索其他字段

我需要將內容與單詞列表（對於淫穢詞匹配）進行匹配。作爲我需要的一個簡單例子。Elasticsearch搜索其他字段

{ 
    "bool": { 
    "should": [ 
     { "term": { "content": "word1" }}, 
     { "term": { "content": "word2" }} 
      : 
     { "term": { "content": "word1001" }} 
    ] 
    } 
}

我找「word1001」是在另一種類型的另一個領域上市「字詞1」，「字詞1」，......這些話。

我需要實現的是類似

{ 
    "bool": { 
    "should": [ 
     { "term": { "content": banned_words.word }}, 
    ] 
    } 
}

我需要匹配數量可能爲成千上萬的話，上述布爾似乎不是最有效的。但是，我找不到替代方案。

來源

2015-09-25 crafter

我想你必須爲此寫一個自定義匹配器。無論如何，1000個元素的香草布爾查詢不會有效。 – Ashalynd

最初的請求會很慢，但是如果您可以使用過濾器而不是查詢禁止的單詞列表，那麼該過濾器將被緩存（使後續執行非常便宜！） –

另一種在查詢時不匹配所有不良詞的方法是在索引時使用synonym token filter來匹配這些詞並標記包含不良詞的文檔。

所有你需要做的是存儲在文件系統中的文件你的壞字（Elasticsearch主目錄）：

analysis/badwords.txt：

word1 => BADWORD  <--- pick whatever you want the badword to be replaced with 
word2 => BADWORD 
... 
word1000 => BADWORD

那麼你的索引設置需要使用synonym令牌過濾

curl -XPUT localhost:9200/my_index -d '{ 
    "settings" : { 
     "analysis" : { 
      "analyzer" : { 
       "badwords" : { 
        "tokenizer" : "whitespace", 
        "filter" : ["synonym"] 
       } 
      }, 
      "filter" : { 
       "synonym" : { 
        "type" : "synonym", 
        "synonyms_path" : "analysis/badwords.txt" 
       } 
      } 
     } 
    }, 
    "mappings": { 
     "my_type": { 
      "properties": { 
       "content": { 
        "type": "string", 
        "index_analyzer": "badwords" 
       } 
      } 
     } 
    } 
}'

然後，當你的索引文檔用content場包含一些BA d字符與您的badwords.txt文件中的字符相匹配，它會被您在同義詞文件中選擇的替換字正確替換。

curl -XPOST 'localhost:9200/my_index/_analyze?analyzer=badwords&pretty' -d 'you are a word2' 
{ 
    "tokens" : [ { 
    "token" : "you", 
    "start_offset" : 0, 
    "end_offset" : 3, 
    "type" : "word", 
    "position" : 1 
    }, { 
    "token" : "are", 
    "start_offset" : 4, 
    "end_offset" : 7, 
    "type" : "word", 
    "position" : 2 
    }, { 
    "token" : "a", 
    "start_offset" : 8, 
    "end_offset" : 9, 
    "type" : "word", 
    "position" : 3 
    }, { 
    "token" : "BADWORD", 
    "start_offset" : 10, 
    "end_offset" : 14, 
    "type" : "SYNONYM", 
    "position" : 4 
    } ] 
}

來源

2015-09-27 05:24:03 Val

謝謝@Val。讓我有機會看到這個解決方案如何與我的實現相匹配。我不一定要實施單詞替換，但是要標記文檔（在分類網站中，有人可能會放棄一隻「可愛的貓貓」），這會引發各種標誌。 – crafter

好的，我看到，也可以不替換badword，而只是用同義詞映射來標記它，比如'word1 => word1，BADWORD'。這樣做可以保留潛在的壞道，但也可以在它後面添加「BADWORD」令牌。 – Val

Elasticsearch搜索其他字段

回答

相關問題