2012-10-03 52 views
0

我在嘗試使用Apache Solr和TikaEntityProcessor對HTML文檔編制索引,其想法是我可以使用XPath從HTML中選擇特定元素。Solr Tika XPath異常

我遵循TikaEntityProcessor Solr Wiki page底部顯示的高級示例。

當我試圖完成數據導入命令,我收到以下錯誤消息(S):

03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.DataImporter doFullImport 
INFO: Starting Full Import 
03-Oct-2012 16:39:48 org.apache.solr.core.SolrCore execute 
INFO: [htmlTest] webapp=/apache-solr-3.6.1 path=/dataimport params={command=full-import} status=0 QTime=31 
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SimplePropertiesWriter readIndexerProperties 
INFO: Read dataimport.properties 
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 deleteAll 
INFO: [htmlTest] REMOVING ALL DOCUMENTS FROM INDEX 
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy onInit 
INFO: SolrDeletionPolicy.onInit: commits:num=1 
    commit{dir=C:\Program Files\Apache Tomcat\conf\apache-solr-3.5.0\htmlTest\data\index,segFN=segments_1e,version=1349187077567,generation=50,filenames=[_u.fnm, _u.nrm, _u.tis, _u.prx, _u.frq, _u.fdx, _u.fdt, _u.tii, segments_1e] 
03-Oct-2012 16:39:48 org.apache.solr.core.SolrDeletionPolicy updateCommits 
INFO: newest commit = 1349187077567 
03-Oct-2012 16:39:48 org.apache.solr.handler.dataimport.SqlEntityProcessor initQuery 
SEVERE: The query failed 'null' 
java.lang.NullPointerException 
    at java.io.File.<init>(File.java:222) 
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619) 
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327) 
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225) 
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375) 
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445) 
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426) 
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log 
SEVERE: Exception while processing: tika-test document : SolrInputDocument[{text=text(1.0)={<html> 

<meta name="Content-Encoding" content="ISO-8859-1"> 
<meta name="Content-Type" content="text/html"> 
<title></title> 

<body> 
    <h1>This is my first heading</h1> 


     This is some content 


    <h1>This is my second heading</h1> 


     This is some more content 


</body></html>}}]:org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619) 
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327) 
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225) 
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375) 
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445) 
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426) 
Caused by: java.lang.NullPointerException 
    at java.io.File.<init>(File.java:222) 
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) 
    ... 11 more 

03-Oct-2012 16:39:48 org.apache.solr.update.processor.LogUpdateProcessor finish 
INFO: {deleteByQuery=*:*} 0 31 
03-Oct-2012 16:39:48 org.apache.solr.common.SolrException log 
SEVERE: Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException 
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:264) 
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:375) 
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:445) 
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:426) 
Caused by: java.lang.RuntimeException: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:621) 
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:327) 
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:225) 
    ... 3 more 
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.NullPointerException 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:65) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:73) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.pullRow(EntityProcessorWrapper.java:330) 
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:296) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:683) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:709) 
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:619) 
    ... 5 more 
Caused by: java.lang.NullPointerException 
    at java.io.File.<init>(File.java:222) 
    at org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:96) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:53) 
    at org.apache.solr.handler.dataimport.BinFileDataSource.getData(BinFileDataSource.java:44) 
    at org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:59) 
    ... 11 more 

03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback 
INFO: start rollback 
03-Oct-2012 16:39:48 org.apache.solr.update.DirectUpdateHandler2 rollback 
INFO: end_rollback 

我的數據導入配置是:

<dataConfig> 
    <dataSource type="BinFileDataSource"/> 
    <dataSource type="FieldReaderDataSource" name="fld"/> 
    <document> 
     <entity name="tika-test" processor="TikaEntityProcessor" 
       url="C:/Program Files/Apache Tomcat/conf/apache-solr-3.5.0/htmlTest/data/html_basic.html" format="html"> 
       <field column="text"/> 
       <entity type="XPathEntityProcessor" forEach="/html" dataField="text"> 
        <field xpath="//h1" column="date" /> 
       </entity> 
     </entity> 
    </document> 
</dataConfig> 

和HTML文檔Solr的是索引是:

<html> 
<head> 
</head> 
<body> 
    <h1>This is my first heading</h1> 
    <div> 
     This is some content 
    </div> 
    <h1>This is my second heading</h1> 
    <div> 
     This is some more content 
    </div> 
</body> 

+1

只是爲了添加一些更多的信息,可以理解XPathEntityProcessor默認爲SqlEntityProcessor作爲其源。出於某種原因,我不認爲它可以綁定到TikaEntityProcessor(如果這是它的工作原理) –

回答

0

您似乎錯過了對正確數據源的引用。它必須是名爲dateSource的實體上的屬性,它與數據源定義本身上的屬性名稱匹配。您似乎已經定義了名稱fld但未引用它。

我建議爲數據源和相應實體明確地執行此操作以避免混淆。