如何通過在apache nutch中爬行來提取html中specefic div的值？

我不使用Nutch 2.2和數據，我檢索爬行的元標記，如何在Apache的Nutch的爬行提取specefic DIV的HTML值如何通過在apache nutch中爬行來提取html中specefic div的值？

來源

2016-09-27 mohamed

你需要重寫parsefilter和使用Jsoup選擇選擇特定的div。

來源

2016-09-28 04:52:30 Abhishek

您將不得不編寫一個插件來擴展HtmlParseFilter以實現您的目標。

您可以使用一些像Jsoup這樣的html解析器來提取這個URL並提取您想要的URL並將它們添加爲outlinks。

樣品HtmlParseFilter實現： -

 public ParseResult filter(Content content, ParseResult parseResult, 
       HTMLMetaTags metaTags, DocumentFragment doc) { 
       // get html content 
       String htmlContent = new String(content.getContent(), StandardCharsets.UTF_8); 
       // parse html using jsoup or any other library. 
       Document document = Jsoup.parse(content.toString(),content.getUrl()); 
       Elements elements = document.select(<your_css_selector_query); 
       // modify/select only required outlinks 
       if (elements != null) { 
        Outlink outlink; 
        List<String> newLinks=new ArrayList<String>(); 
        List<Outlink> outLinks=new ArrayList<Outlink>(); 
        String absoluteUrl; 
        Outlink outLink; 
        for (Element element : elements){ 
        absoluteUrl=element.absUrl("href"); 
        if(includeLinks(absoluteUrl,value)) { 
         if(!newLinks.contains(absoluteUrl)){ 
          newLinks.add(absoluteUrl); 
          outLink=new Outlink(absoluteUrl,element.text()); 
          outLinks.add(outLink); 
          } 
         } 
         } 
        Parse parse = parseResult.get(content.getUrl()); 
        ParseStatus status = parse.getData().getStatus(); 
        Title title = document.title(); 
        Outlink[] newOutLinks = (Outlink[])outLinks.toArray(new Outlink[outLinks.size()]); 
        ParseData parseData = new ParseData(status, title, newOutLinks, parse.getData().getContentMeta(), parse.getData().getParseMeta()); 
        parseResult.put(content.getUrl(), new ParseText(elements.text()), parseData); 
        } 
        //return parseResult with modified outlinks 
        return parseResult; 
      }

使用Ant構建新的插件和Nutch的-site.xml中添加插件。

<property> 
    <name>plugin.includes</name> 
    <value>protocol-httpclient|<custom_plugin>|urlfilter-regex|parse-(tika|html|js|css)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value> 
    </property>

而且在解析器plugins.xml您可以使用自定義插件，而不是使用蒂卡通過這樣的默認插件： -

<!-- 
    <mimeType name="text/html"> 
     <plugin id="parse-html" /> 
    </mimeType> 

     <mimeType name="application/xhtml+xml"> 
     <plugin id="parse-html" /> 
    </mimeType> 
--> 

    <mimeType name="text/xml"> 
     <plugin id="parse-tika" /> 
     <plugin id="feed" /> 
    </mimeType> 

    <mimeType name="text/html"> 
     <plugin id="<custom_plugin>" /> 
    </mimeType> 

       <mimeType name="application/xhtml+xml"> 
     <plugin id="<custom_plugin>" /> 
    </mimeType>

來源

2016-09-28 04:54:22 Sachin

如何通過在apache nutch中爬行來提取html中specefic div的值？

回答

相關問題