在ElasticSearch 5.5.0中,我正在討論「more_like_this」子句,但無法找到相關文檔。我在ElasticSearch中有下面的數據,「description」字段有大小超過100萬字節的非索引數據。像下面我有一萬個文件。我怎樣才能找出一組文檔,其相互匹配,至少80%:ElasticSearch 5.5.0:查找相關文檔
{
"_index": "school",
"_type": "book",
"_id": "1",
"_source": {
"title": "How to drive safely",
"description": "LOTS OF WORDS...The book is written to help readers about giving driving safety guidelines. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum. LONG...."
}
}
最後,我要尋找的文件ID的該列表至少有80項%的配套內容。可以預期的結果包含匹配的文檔ID(任何格式是好的):
[ [1,30, 500, 8000], [2, 40, 199], .... ]
我需要編寫批處理和每個文檔與所有其他人比較,並建立輸出設定?
請幫忙。
有人可以幫忙嗎? –
匹配什麼內容?所有字段或少數選定字段?你想建立一些去重邏輯?恐怕你必須通過迭代所有文檔來處理代碼或邏輯。 –
尋找比較所有可用10,000本書的「描述」字段與每本書「描述」字段的比較。需要找到80%的配套書籍。 –