2017-08-16 71 views
3

在ElasticSearch 5.5.0中,我正在討論「more_like_this」子句,但無法找到相關文檔。我在ElasticSearch中有下面的數據,「description」字段有大小超過100萬字節的非索引數據。像下面我有一萬個文件。我怎樣才能找出一組文檔,其相互匹配,至少80%:ElasticSearch 5.5.0:查找相關文檔

{ 
    "_index": "school", 
    "_type": "book", 
    "_id": "1", 
    "_source": { 
     "title": "How to drive safely", 
     "description": "LOTS OF WORDS...The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...." 
    } 
} 

最後,我要尋找的文件ID的該列表至少有80項%的配套內容。可以預期的結果包含匹配的文檔ID(任何格式是好的):

[ [1,30, 500, 8000], [2, 40, 199], .... ] 

我需要編寫批處理和每個文檔與所有其他人比較,並建立輸出設定?

請幫忙。

+0

有人可以幫忙嗎? –

+0

匹配什麼內容?所有字段或少數選定字段?你想建立一些去重邏輯?恐怕你必須通過迭代所有文檔來處理代碼或邏輯。 –

+0

尋找比較所有可用10,000本書的「描述」字段與每本書「描述」字段的比較。需要找到80%的配套書籍。 –

回答

2

more like this query有一個參數叫做minimum_should_match,它可以設置爲80%。但這裏還需要考慮max_query_terms參數。

最重要的是,當你爲這些字段的內容編制索引時,這個onls工作。

此外,在查詢時這樣做聽起來像是一個非常緩慢的操作。您可能需要重新考慮您的策略,並在索引時間對文檔進行集羣/分組(這是您需要做的事情,因爲這是一個非常自定義的事情),以便搜索變得快速。

+0

謝謝@alr,感謝您的回覆。我已經將minimum_should_match應用到了80%,但由於它是非常大的字符串,匹配似乎並不正確。對於10,000個文檔,我認爲一臺機器可以處理負載。你有什麼其他的建議? –

+1

@NikhilJoshi - 答案還提到,你需要看看'max_query_terms' - 你試過增加這個參數嗎?請記住,'max_query_terms'是相對於文檔中_unique_的數量(而不是文檔的長度)。 (這與數據的未索引大小無關,重要的是有多少獨特的術語,如果幸運的話,可能會低於您的想象。) – dshockley

+0

我提供了maxQueryTerms值爲100.是否這意味着如果文檔只有匹配的75個字,它會被忽略? –