Nutch 1.13索引鏈接配置

我正在嘗試在使用Apache Nutch 1.13和Solr 4.10.4進行爬網運行期間提取webgraph結構。Nutch 1.13索引鏈接配置

根據文檔，索引鏈接插件將outlinks和inlinks添加到集合中。

我已經在Solr中相應地更改了我的集合（通過schema.xml中的各個字段並重新啓動Solr），並調整了solr-mapping文件，但無濟於事。由此產生的錯誤可以在下面看到。

bin/nutch index -D solr.server.url=http://localhost:8983/solr/collection1 crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* -filter -normalize -deleteGone 
Segment dir is complete: crawl/segments/20170503114357. 
Indexer: starting at 2017-05-03 11:47:02 
Indexer: deleting gone documents: true 
Indexer: URL filtering: true 
Indexer: URL normalizing: true 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexing 1/1 documents 
Deleting 0 documents 
Indexing 1/1 documents 
Deleting 0 documents 
Indexer: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

有趣的是，我自己的研究使我的假設，它實際上是不平凡的，因爲得到的解析（無插件）是這樣的：

bin/nutch indexchecker http://www.my-domain.com/ 
fetching: http://www.my-domain.com/ 
robots.txt whitelist not configured. 
parsing: http://www.my-domain.com/ 
contentType: application/xhtml+xml 
tstamp : Wed May 03 11:40:57 CEST 2017 
digest : e549a51553a0fb3385926c76c52e0d79 
host : http://www.my-domain.com/ 
id : http://www.my-domain.com/ 
title : Startseite 
url : http://www.my-domain.com/ 
content : bla bla bla bla.

然而，一旦我使index-links，輸出突然看起來是這樣的：

bin/nutch indexchecker http://www.my-domain.com/ 
fetching: http://www.my-domain.com/ 
robots.txt whitelist not configured. 
parsing: http://www.my-domain.com/ 
contentType: application/xhtml+xml 
tstamp : Wed May 03 11:40:57 CEST 2017 
outlinks : http://www.my-domain.com/2-uncategorised/331-links-administratives 
outlinks : http://www.my-domain.com/2-uncategorised/332-links-extern 
outlinks : http://www.my-domain.com/impressum.html 
id : http://www.my-domain.com/ 
title : Startseite 
url : http://www.my-domain.com/ 
content : bla bla bla

顯然，這不能放入一個單一的領域，但我只是想有一個列表中的所有outlinks（我讀過inlinks不工作，但我不需要它們）。

來源

2017-05-03 dennlinger

您必須指定在solrindex-mapping.xml領域這樣

<field dest="inlinks" source="inlinks"/> 
<field dest="outlinks" source="outlinks"/>

之後，請務必卸載和重裝集合，包括Solr的一個完整的重新啓動。

你沒有具體說明究竟你在schema.xml實現的領域，但對我下面的工作：

<!-- fields for index-links plugin --> 
<field name="inlinks" type="url" stored="true" indexed="false" multiValued="true"/> 
<field name="outlinks" type="url" stored="true" indexed="false" multiValued="true"/>

最好的問候，祝你好運！

來源

2017-05-03 11:25:50

啊，謝謝你的提示！我不認爲實際上有必要重新加載收藏...愚蠢的我，感謝提示！ – dennlinger

Nutch 1.13索引鏈接配置

回答

相關問題