使用Apache Solr索引pdf文件內容

我使用Solr的php extension與Apache Solr進行交互。我正在索引數據庫中的數據。我想索引外部文件的內容（如PDF，PPTX）。使用Apache Solr索引pdf文件內容

邏輯索引是：假設schema.xml具有所定義的以下字段：

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="created" type="tlong" indexed="true" stored="true" /> 
<field name="name" type="text_general" indexed="true" stored="true"/> 
<field name="filepath" type="text_general" indexed="false" stored="true"/> 
<field name="filecontent" type="text_general" indexed="false" stored="true"/>

單個數據庫條目可以/可以不具有存儲的文件。

因此，下面是我的索引代碼：

$post = stdclass object having the database content 
$doc = new SolrInputDocument(); 
$doc->addField('id', $post->id); 
$doc->addField('name', $post->name); 
.... 
.... 
$res = $client->addDocument($doc); 
$client->commit();

接下來，我要添加的PDF文件的內容相同Solr的文檔中如上。

這是curl代碼：

$ch = curl_init(' 
http://localhost:8010/solr/update/extract?'); 
curl_setopt ($ch, CURLOPT_POST, 1); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath)); 
$result= curl_exec ($ch);

不過，我想我失去了一些東西。我讀了documentation，但我想不出檢索文件的內容，然後將它添加到現有的Solr文檔中field: filecontent

編輯＃1的方式：如果我嘗試設置literal.id=xyz在捲曲請求，它會創建一個新的solr文檔，其中包含id=xyz。我不想創建一個新的solr文檔。我希望將pdf的內容編入索引並作爲字段存儲在先前創建的solr文檔中。

$doc = new SolrInputDocument();//Solr document is created 
$doc->addField('id', 98765);//The solr document created above is assigned an id=`98765` 
.... 
.... 
$ch = curl_init(' 
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true'); 
curl_setopt ($ch, CURLOPT_POST, 1); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath)); 
$result= curl_exec ($ch);

我想上述Solr的文件（id = 98765）有一個字段，其中PDF格式的內容被索引&存儲。

但cURL請求（如上）創建另一個新文檔（與id = 1）。我不想那樣。

來源

2013-07-12 xan

Solr與Apache Tika一起處理提取富文檔的內容並將其添加回Solr文檔。

Documentation： -

您可能會注意到，雖然你可以在任何樣本文檔中的文本的搜索，你可能無法看到文字文檔檢索時。這只是因爲Tika生成的「內容」字段被映射到稱爲「text」的Solr字段，該字段是索引但未存儲。這是通過solrconfig.xml中 /update/extract處理程序中的默認映射規則完成的，可以輕鬆更改，也可以覆蓋。如果要定義爲維護文件內容覆蓋不同的屬性
 
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/> 
- ：例如，存儲和查看所有元數據和內容，執行以下命令：

默認的schema.xml在solrconfig.xml本身中默認爲fmap.content=filecontent。

的fmap.content = attr_content PARAM覆蓋默認 fmap.content =文本使得內容被添加到attr_content 字段代替。

如果要在單個文檔中使用文字前綴（例如， literal.id=1&literal.name=Name具有屬性

$ch = curl_init(' 
http://localhost:8010/solr/update/extract?literal.id=1&literal.name=Name&commit=true'); 
curl_setopt ($ch, CURLOPT_POST, 1); 
curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>'@'.$post->filepath)); 
$result= curl_exec ($ch);

來源

2013-07-15 04:35:55 Jayendra

您不明白我的問題。我已完成索引。搜索也是成功的。當執行wiki中給出的curl命令時，它會將其添加爲「新」Solr文檔。 'curl「http：// localhost：8983/solr/update/extract？literal.id = doc1＆commit = true」-F「[email protected]」' - >這個命令添加了一個新的solr文檔，它有'id = doc1 '，從'tutorial.html'索引內容並提交。我想將html/pdf的內容作爲先前定義的solr文檔中的字段添加，以便不創建「新」文檔，但將字段添加到現有文檔中。 – xan

是否要將多個富文檔添加到單個Solr文檔？ Solr不允許使用單個文檔創建多個豐富的文檔，但您可以將文檔壓縮到一起並將其提供給solr。檢查SOLR-2332。您也可以檢查Solr部分更新以將文檔添加到多值字段中。 – Jayendra

編號不是多個文檔。 '$ doc = new SolrInputDocument（）'創建一個新的solr文檔。然後我添加字段（'id'，'name'，'title'等）。僅在本文檔中，我想添加pdf文件的內容。但是，當我觸發cURL請求（如上面的代碼）時，它會創建另一個帶有自己字段的新solr文檔。 – xan

使用Apache Solr索引pdf文件內容

回答

相關問題