我正在嘗試爲nutch創建一個插件。我正在使用nutch 1.7和solr。我使用了很多不同的教程。我想實現一個返回原始html數據的插件。我使用nutch的標準wiki和以下教程:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html如何創建一個nutch插件,將原始html返回給解析器
我創建了兩個文件getDivinfohtml.java和getDivinfo.java。
getDivinfohtml.java需要讀取內容並返回完整的源代碼。或者源代碼
package org.apache.nutch.indexer;
public class getDivInfohtml implements HtmlParseFilter
{
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class);
private Configuration conf;
public static final String TAG_KEY = "source";
// Logger logger = Logger.getLogger("mylog");
// FileHandler fh;
//FileSystem fs = FileSystem.get(conf);
//Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
//SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);
//Text key = new Text();
// Content content = new Content();
// fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log");
// logger.addHandler(fh);
// SimpleFormatter formatter = new SimpleFormatter();
//fh.setFormatter(formatter);
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
{
try
{
LOG.info("Parsing Url:" + content.getUrl());
LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html")));
Parse parse = parseResult.get(content.getUrl());
Metadata metadata = parse.getData().getParseMeta();
String fullContent = metadata.get("fullcontent");
Document document = Jsoup.parse(fullContent);
Element contentwrapper = document.select("div#jobBodyContent").first();
String source = contentwrapper.text();
metadata.add("SOURCE", source);
return parseResult;
}
catch(Exception e)
{
LOG.info(e);
}
return parseResult;
}
public Configuration getConf()
{
return conf;
}
public void setConf(Configuration conf)
{
this.conf = conf;
}
}
的ATLEAST一部分它現在讀取compelete內容,然後提取jobBodyContent的文本。
接着我們需要把數據放到領域
getDivinfo(解析器)
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks)
{
// LOG.info("Julien is sukkel");
try
{
fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log");
SimpleFormatter formatter = new SimpleFormatter();
fh.setFormatter(formatter);
logger.info("Julien is sukkel");
Metadata metadata = parse.getData().getParseMeta();
logger.info("julien is gek:");
String fullContent = metadata.get("SOURCE");
logger.info("Output:" + metadata);
logger.info(fullContent);
String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY);
logger.info(fullSource);
doc.add("divcontent", fullContent);
}
catch(Exception e)
{
//LOG.info(e);
}
return doc;
}
的誤差修改是getDivinfo解析器:字符串fullSource = parse.getData()getParseMeta( ).getValues(getDivInfohtml.TAG_KEY);
[javac的] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58:錯誤:無法找到符號 [javac的] String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);