如何使用Elasticsearch ingest-attachment插件爲PDF文件建立索引？

我必須使用Elasticsearch接收插件在pdf文檔中實現基於全文的搜索。當我試圖搜索pdf文檔中的單詞someword時，我得到一個空的命中數組。如何使用Elasticsearch ingest-attachment插件爲PDF文件建立索引？

//Code for creating pipeline 

PUT _ingest/pipeline/attachment 
{ 
    "description" : "Extract attachment information", 
    "processors" : [ 
     { 
     "attachment" : { 
     "field" : "data", 
     "indexed_chars" : -1 
     } 
     } 
    ] 
} 

//Code for creating the index 

PUT my_index/my_type/my_id?pipeline=attachment 
{ 
    "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf", 
    "title" : "Quick", 
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=" 

} 

//Code for searching the word in pdf 

GET /my_index/my_type/_search 
{ 
    "query": { 
    "match": { 
     "data" : { 
     "query" : "someword" 
    } 
} 
}

來源

2017-02-08 Ashley

時，如果你在一個PDF查看器打開PDF，你能在裏面搜索「someword」，並找到一個匹配？ – Alcanzar

@Alcanzar是的，它搜索單詞。 – Ashley

這看起來像http://stackoverflow.com/questions/37861279/how-to-index-a-pdf-file-in-elasticsearch-5-0-0-with-ingest-attachment-plugin的副本 - 請注意，您的PUT語句爲該文件添加了特定的「數據」。你需要使用curl或類似的東西來傳遞特定的文件數據。您輸入的「數據」是「Lorem ipsum dolor sit amet」 - 如果您搜索Lorem，您會發現結果 – Alcanzar

當指數與第二個命令將文檔通過將Base64編碼的內容，文檔則是這樣的：

 { 
      "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf", 
      "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=", 
      "attachment": { 
       "content_type": "application/rtf", 
       "language": "ro", 
       "content": "Lorem ipsum dolor sit amet", 
       "content_length": 28 
      }, 
      "title": "Quick" 
     }

所以，你的查詢需要考慮的attachment.content領域，而不是data一個（只用於發送索引中的原始內容的目的）

修改您的查詢到這一點，它的工作：

POST /my_index/my_type/_search 
{ 
    "query": { 
     "match": { 
     "attachment.content": {   <---- change this 
      "query": "lorem" 
     } 
     } 
    } 
}

PS：使用POST代替GET發送有效載荷

來源

2017-02-11 14:38:29 Val

很高興這成功了。還有其他什麼需要？ – Val

關於如何使用彈性搜索將pdf文件轉換爲base64編碼文件的任何想法？ – Ashley

我認爲這應該是一個新問題，因爲它與這個無關。 – Val

如何使用Elasticsearch ingest-attachment插件爲PDF文件建立索引？

回答

相關問題