2014-03-25 24 views
0

我正在嘗試爲nutch創建一個插件。我正在使用nutch 1.7和solr。我使用了很多不同的教程。我想實現一個返回原始html數據的插件。我使用nutch的標準wiki和以下教程:http://sujitpal.blogspot.nl/2009/07/nutch-custom-plugin-to-parse-and-add.html如何創建一個nutch插件,將原始html返回給解析器

我創建了兩個文件getDivinfohtml.java和getDivinfo.java。

getDivinfohtml.java需要讀取內容並返回完整的源代碼。或者源代碼

package org.apache.nutch.indexer; 
public class getDivInfohtml implements HtmlParseFilter 
{ 
private static final Log LOG = LogFactory.getLog(getDivInfohtml.class); 
private Configuration conf; 
    public static final String TAG_KEY = "source"; 
    // Logger logger = Logger.getLogger("mylog"); 
    // FileHandler fh; 
    //FileSystem fs = FileSystem.get(conf); 
    //Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data"); 
    //SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf); 
    //Text key = new Text(); 
    // Content content = new Content(); 
    // fh = new FileHandler("/root/JulienKulkerNutch/mylogfile.log"); 
// logger.addHandler(fh); 
// SimpleFormatter formatter = new SimpleFormatter(); 
//fh.setFormatter(formatter); 


public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc) 
{ 
    try 
    { 
     LOG.info("Parsing Url:" + content.getUrl()); 
     LOG.info("Julien: "+content.toString().substring(content.toString().indexOf("<!DOCTYPE html"))); 

     Parse parse = parseResult.get(content.getUrl()); 
     Metadata metadata = parse.getData().getParseMeta(); 
     String fullContent = metadata.get("fullcontent"); 

     Document document = Jsoup.parse(fullContent); 
     Element contentwrapper = document.select("div#jobBodyContent").first(); 
     String source = contentwrapper.text(); 
     metadata.add("SOURCE", source); 

     return parseResult; 

    } 
    catch(Exception e) 
    { 
     LOG.info(e); 
    } 

    return parseResult; 
} 


public Configuration getConf() 
{ 
    return conf; 
} 

public void setConf(Configuration conf) 
{ 
    this.conf = conf; 
} 

}

的ATLEAST一部分它現在讀取compelete內容,然後提取jobBodyContent的文本。

接着我們需要把數據放到領域

getDivinfo(解析器)

public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) 
{ 
    // LOG.info("Julien is sukkel"); 
    try 
    { 
     fh = new FileHandler("/root/JulienKulkerNutch/mylogfile2.log"); 
     SimpleFormatter formatter = new SimpleFormatter(); 
     fh.setFormatter(formatter); 
     logger.info("Julien is sukkel"); 
     Metadata metadata = parse.getData().getParseMeta(); 
     logger.info("julien is gek:"); 
     String fullContent = metadata.get("SOURCE"); 
     logger.info("Output:" + metadata); 
     logger.info(fullContent); 
     String fullSource = parse.getData().getParseMeta().getValues(getDivInfohtml.TAG_KEY); 
     logger.info(fullSource); 
     doc.add("divcontent", fullContent); 

    } 
    catch(Exception e) 
    { 
     //LOG.info(e); 
    } 

    return doc; 
} 

的誤差修改是getDivinfo解析器:字符串fullSource = parse.getData()getParseMeta( ).getValues(getDivInfohtml.TAG_KEY);

[javac的] /root/JulienKulkerNutch/apache-nutch-1.8/src/plugin/myDivSelector/src/java/org/apache/nutch/indexer/getDivInfo.java:58:錯誤:無法找到符號 [javac的] String fullSource = parse.getData()。getParseMeta()。getValues(getDivInfohtml.TAG_KEY);

回答

0

您可能需要實現HTMLParser。在您的getFields實施中,

private static final Collection<WebPage.Field> FIELDS = new HashSet<WebPage.Field>(); 
    static { 
    FIELDS.add(WebPage.Field.CONTENT); 
    FIELDS.add(WebPage.Field.OUTLINKS); 
    } 
    public Collection<Field> getFields() { 
    return FIELDS; 
    } 
相關問題