使用Solr索引阿拉伯語PDF文件

我正在嘗試使用Solr和Tika搜索文本文檔。一切工作正常的.docx，.pptx，.csv，.xlsx，..但是當涉及到.pdf文件，它返回空的內容。我無法弄清楚問題所在！使用Solr索引阿拉伯語PDF文件

2016-11-16 LHAD

您是如何編制索引的？ – vinod

我在solrconfig文件中使用了ExtractRequestHandler，然後使用curl命令來索引PDF文件。它獲取所有正確的元數據，但內容如下：attr_filecontent「：[」\ n \ n \ n \ n「] – LHAD

如果使用post.jar索引文件使用-Dauto

例子：

java -Dauto -Dc=collection_name -jar post.jar pdf_file.pdf

使用-Dauto我們可以索引蒂卡支持所有文檔格式。即TXT，DOC，DOCX，PDF，XML，HTML等

這些阿拉伯過濾器類添加到字段定義

<fieldType name="text_general_arabic" class="solr.TextField" positionIncrementGap="100"> 
    <analyzer type="index"> 
    <tokenizer class="solr.StandardTokenizerFactory"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="arabic_stopwords.txt" enablePositionIncrements="true" /> 
    <filter class="solr.ArabicNormalizationFilterFactory"/> 
    <filter class="solr.ArabicStemFilterFactory"/>  
    </analyzer> 
    <analyzer type="query"> 
    <tokenizer class="solr.StandardTokenizerFactory"/> 
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="arabic_stopwords.txt" enablePositionIncrements="true" /> 
    <filter class="solr.ArabicNormalizationFilterFactory"/> 
    <filter class="solr.ArabicStemFilterFactory"/>  
    </analyzer> 
</fieldType>

來源

2016-11-16 10:27:41 vinod

我試過了，但得到的結果相同與pdf文件 – LHAD

忘了告知，你需要在你的模式文件中包含阿拉伯語過濾器來定義字段定義 – vinod

我在模式文件中包含了阿拉伯語過濾器我甚至有英文pdf文件也有同樣的問題 – LHAD

它正確解析PDF的很困難，因爲PDF中還可以包含文本或圖像。我們創建了一個工具來輕鬆搜索任何文件的內容。根據我們的經驗：

解析PDF文件使用PDFBOX第一
如果第1步歸零 - >做的OCR

過程的完整描述，您可以在我們的博客https://blog.ambar.cloud/ingest-attachment-plugin-for-elasticsearch-should-you-use-it/

找到

希望它有幫助。

P.s.我們的集成解決方案https://github.com/RD17/ambar

來源

2017-04-17 09:17:39 SochiX

使用Solr索引阿拉伯語PDF文件

回答

相關問題