2011-04-17 53 views
2

我試圖索引使用SolrJ一些PDF文檔如http://wiki.apache.org/solr/ContentStreamUpdateRequestExample描述,下面有代碼:如何索引pdf的內容與SolrJ?

import static org.apache.solr.handler.extraction.ExtractingParams.LITERALS_PREFIX; 
import static org.apache.solr.handler.extraction.ExtractingParams.MAP_PREFIX; 
import static org.apache.solr.handler.extraction.ExtractingParams.UNKNOWN_FIELD_PREFIX; 

import org.apache.solr.client.solrj.SolrServer; 
import org.apache.solr.client.solrj.SolrServerException; 
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer; 
import org.apache.solr.client.solrj.request.AbstractUpdateRequest; 
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest; 
import org.apache.solr.common.util.NamedList; 
... 
public static void indexFilesSolrCell(String fileName) throws IOException, SolrServerException { 

    String urlString = "http://localhost:8080/solr"; 
    SolrServer server = new CommonsHttpSolrServer(urlString); 

    ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract"); 
    up.addFile(new File(fileName)); 
    String id = fileName.substring(fileName.lastIndexOf('/')+1); 
    System.out.println(id); 

    up.setParam(LITERALS_PREFIX + "id", id); 
    up.setParam(LITERALS_PREFIX + "location", fileName); // this field doesn't exists in schema.xml, it'll be created as attr_location 
    up.setParam(UNKNOWN_FIELD_PREFIX, "attr_"); 
    up.setParam(MAP_PREFIX + "content", "attr_content"); 
    up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); 

    NamedList<Object> request = server.request(up); 
    for(Entry<String, Object> entry : request){ 
    System.out.println(entry.getKey()); 
    System.out.println(entry.getValue()); 
    } 
} 

不幸的是在查詢時,*:*我獲得索引文件的列表中,但含量場空。如何更改上面的代碼以提取文檔的內容?

下面還有的XML frament描述this document

<doc> 
    <arr name="attr_content"> 
    <str>   </str> 
    </arr> 
    <arr name="attr_location"> 
    <str>/home/alex/Documents/lsp.pdf</str> 
    </arr> 
    <arr name="attr_meta"> 
    <str>stream_size</str> 
    <str>31203</str> 
    <str>Content-Type</str> 
    <str>application/pdf</str> 
    </arr> 
    <arr name="attr_stream_size"> 
    <str>31203</str> 
    </arr> 
    <arr name="content_type"> 
    <str>application/pdf</str> 
    </arr> 
    <str name="id">lsp.pdf</str> 
</doc> 

我不認爲這個問題是關係到一個不正確的安裝Apache Tika的因爲以前我有幾個ServerException但現在我已經將所需的罐子安裝在正確的路徑中。此外,我試圖索引一個txt文件使用相同的類,但attr_content字段總是空的。

回答

1

在schema.xml文件中,您是否在內容字段中設置了「stored = true」,這是我的schema.xml文件的一個示例,用於存儲pdf和其他二進制文件的內容。

<field name="text" type="textgen" indexed="true" stored="true" required="false" multiValued="true"/>

你覺得如何?

Héctor