2012-01-09 70 views
0

我嘗試使用1.4 Nutch的抓取,但我面對的解析錯誤,這是日誌文件:Nutch的無法成功解析內容

2012-01-09 09:12:02,696 INFO parse.ParseSegment - ParseSegment: starting at   2012-01-09 09:12:02 
2012-01-09 09:12:02,697 INFO parse.ParseSegment - ParseSegment: segment: crawl/segments/20120109091153 
2012-01-09 09:12:03,416 WARN parse.ParseUtil - Unable to successfully parse content http://sujitpal.blogspot.com/ of type application/xhtml+xml 
2012-01-09 09:12:03,417 INFO parse.ParseSegment - Parsing: http:// sujitpal.blogspot.com/ 
2012-01-09 09:12:03,418 WARN parse.ParseSegment - Error parsing: http://sujitpal.blogspot.com/: failed(2,200): org.apache.nutch.parse.ParseException: Unable to successfully parse content 
2012-01-09 09:12:03,419 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature 

通過檢查配置/ Nutch的-site.xml中,我發現HTML |文| XHTML | XML包括在plugin.includes preperty

<property> 
<name>plugin.includes</name> 
<value>myplugins|protocol-httpclient|query-(basic|site|url)|summary- 
basic|urlfilter-  
regex|parse-(xml|xhtml|html|tika|text|js)|index-(basic|anchor)|scoring- 
opic|urlnormalizer-(pass|regex|basic)|query-(basic|site|url)|response-(json|xml) 
</value> 
<description>Regular expression naming plugin directory names to 
include. Any plugin not matching this expression is excluded. 
In any case you need at least include the nutch-extensionpoints plugin. By 
default Nutch includes crawling just HTML and plain text via HTTP, 
and basic indexing and search plugins. In order to use HTTPS please enable 
protocol-httpclient, but be aware of possible intermittent problems with the 
underlying commons-httpclient library. 
</description> 
</property> 

爲什麼不能解析的XHTML/XML或者甚至文本/ XML?

回答

1

你配置了哪些插件?如果您使用的是tika,那麼tika會將mime類型(如xhtml/xml)映射到解析器。如果在配置文件中沒有條目,則不會發生任何事情。

您可以禁用tika並只使用parse-html插件。

我使用我們的默認插件配置測試了您的網站。

protocol-http|urlfilter-regex|parse-(html)|index-(basic|anchor) 
|query- (basic|site|url)|response-(json|xml) 
|summary-basic|scoring-opic|urlnormalizer-  
(pass|regex|basic) 

並得到您的網頁分析。

Parsed (32ms):http://sujitpal.blogspot.com/ 

素不相識 JPEE