2014-11-17 48 views
2

我想安裝craweler4j。我從Netbeans的源代碼構建它。我使用的3.5版本crawler4j和調用類相同的網站上,一旦給定的 - 複製下面簡單 -Crawler4j - NoSuchMethod getOutgoingUrls()

public class MyCrawler extends WebCrawler { 

    private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" 
                 + "|png|tiff?|mid|mp2|mp3|mp4" 
                 + "|wav|avi|mov|mpeg|ram|m4v|pdf" 
                 + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); 

    /** 
    * You should implement this function to specify whether 
    * the given url should be crawled or not (based on your 
    * crawling logic). 
    */ 
    @Override 
    public boolean shouldVisit(WebURL url) { 
      String href = url.getURL().toLowerCase(); 
      return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/"); 
    } 

    /** 
    * This function is called when a page is fetched and ready 
    * to be processed by your program. 
    */ 
    @Override 
    public void visit(Page page) {   
      String url = page.getWebURL().getURL(); 
      System.out.println("URL: " + url); 

      if (page.getParseData() instanceof HtmlParseData) { 
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData(); 
        String text = htmlParseData.getText(); 
        String html = htmlParseData.getHtml(); 
        List<WebURL> links = htmlParseData.getOutgoingUrls(); 

        System.out.println("Text length: " + text.length()); 
        System.out.println("Html length: " + html.length()); 
        System.out.println("Number of outgoing links: " + links.size()); 
      } 
    } 

}

public class Controller { 
    public static void main(String[] args) throws Exception { 
      String crawlStorageFolder = "/data/crawl/root"; 
      int numberOfCrawlers = 7; 

      CrawlConfig config = new CrawlConfig(); 
      config.setCrawlStorageFolder(crawlStorageFolder); 

      /* 
      * Instantiate the controller for this crawl. 
      */ 
      PageFetcher pageFetcher = new PageFetcher(config); 
      RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
      RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
      CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); 

      /* 
      * For each crawl, you need to add some seed urls. These are the first 
      * URLs that are fetched and then the crawler starts following links 
      * which are found in these pages 
      */ 
      controller.addSeed("http://www.ics.uci.edu/~welling/"); 
      controller.addSeed("http://www.ics.uci.edu/~lopes/"); 
      controller.addSeed("http://www.ics.uci.edu/"); 

      /* 
      * Start the crawl. This is a blocking operation, meaning that your code 
      * will reach the line after this only when crawling is finished. 
      */ 
      controller.start(MyCrawler.class, numberOfCrawlers);  
    } 

}

在代碼編譯成功,但拋出運行時異常。請建議。

Exception in thread "Crawler 1" java.lang.NoSuchMethodError: edu.uci.ics.crawler4j.parser.HtmlParseData.getOutgoingUrls()Ljava/util/Set; 
    at MyCrawler.visit(MyCrawler.java:42) 
    at edu.uci.ics.crawler4j.crawler.WebCrawler.processPage(WebCrawler.java:351) 
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:220) 
    at java.lang.Thread.run(Thread.java:744) 

我挖了一遍代碼,發現有一個同名的類。但仍然是錯誤。

+0

如果您發現我的答案可以接受,我可以請求接受嗎? – Chaiavi

回答

1

你的代碼看起來不錯。

您可能已經以某種方式進入了一些依賴類路徑 - 也許您有兩個不同版本的crawler4j庫?

無論如何,我建議如下: 看看新crawler4j github上:https://github.com/yasserg/crawler4j

使用Maven依賴的系統和所有的麻煩將會消失!:

<dependency> 
    <groupId>edu.uci.ics</groupId> 
    <artifactId>crawler4j</artifactId> 
    <version>4.1</version> 
</dependency> 

您將得到最新版本(現在在github而不是谷歌代碼)並使用Maven你會自動逃脫所有classpath地獄...

在最新版本,我已經修復了很多錯誤,所以我真的建議移動t o最新最好的