從Apache Solr中提取PDF

我是Solr索引編制的新手。我使用Solr 5.5並索引了一個pdf文件，只需使用從Apache Solr中提取PDF

#bin/post -c gettingstarted /home/ubuntu/pdf.pdf

我刪除了源代碼pdf文件。無論如何，我可以從Apache Solr中提取PDF文件。我可以看到它是從URL索引

http://localhost:8983/solr/gettingstarted/select?q=*.pdf

在此先感謝。

來源

2017-07-09 Saqib Iqbal

如果默認情況下它的索引正確，則如果在模式中正確聲明瞭pdf內容，則將其索引到字段名稱content中。因此請使用該內容字段搜索一些關鍵字（或*）。

例： q=content:keyword（關鍵字 - >其存在於PDF）

http://localhost:8983/solr/gettingstarted/select?q=content:*

如果contetnt字段是未定義的。然後在模式文件中添加字段定義。

例：字段名稱聲明

<field name="content" type="text_general" indexed="true" stored="true" multiValued="true"/>

字段類型確定指標

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> 
     <analyzer type="index"> 
     <tokenizer class="solr.StandardTokenizerFactory"/> 
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> 
     <filter class="solr.LowerCaseFilterFactory"/> 
     </analyzer> 
     <analyzer type="query"> 
     <tokenizer class="solr.StandardTokenizerFactory"/> 
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" /> 
     <filter class="solr.LowerCaseFilterFactory"/> 
     </analyzer> 
    </fieldType>

來源

2017-07-10 07:28:04 vinod

我索引PDF這樣 '斌/後-c gettingstarted /家庭/ Ubuntu的/ pdf.pdf' ' http：// localhost：8983/solr/gettingstarted/select？q = content'顯示與'q = *。pdf'相同的結果。 'http：// localhost：8983/solr/gettingstarted/select？q = content：*'給出404錯誤。請任何建議。 –

這可能表示您沒有內容字段。用'*：*'搜索並應用必要的'fq'來查找你的文檔。 –

你得到了什麼錯誤。？正如我前面提到的那樣，@BinoyDalal說'內容'字段可能沒有在模式文件中定義。檢查一下。我想你沒有正確的索引pdf。 – vinod

從Apache Solr中提取PDF

回答

相關問題