2016-02-03 50 views
0

鑑於這種簡單的代碼:爲什麼Crawler4j非阻塞方法沒有等待隊列中的鏈接?

CrawlConfig config = new CrawlConfig(); 
config.setMaxDepthOfCrawling(1); 
config.setPolitenessDelay(1000); 
config.setResumableCrawling(false); 
config.setIncludeBinaryContentInCrawling(false); 
config.setCrawlStorageFolder(Config.get(Config.CRAWLER_SHARED_DIR) + "test/"); 
config.setShutdownOnEmptyQueue(false); 
PageFetcher pageFetcher = new PageFetcher(config); 
RobotstxtConfig robotstxtConfig = new RobotstxtConfig(); 
robotstxtConfig.setEnabled(false); 
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher); 
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer); 
controller.addSeed("http://localhost/test"); 

controller.startNonBlocking(WebCrawler.class, 1); 


long counter = 1; 
while(Thread.currentThread().isAlive()) { 
    System.out.println(config.toString()); 
    for (int i = 0; i < 4; i++) { 
     System.out.println("Adding link"); 
     controller.addSeed("http://localhost/test" + ++counter + "/"); 
    } 

    try { 
     TimeUnit.SECONDS.sleep(5); 
    } catch (InterruptedException e) { 
     e.printStackTrace(); 
    } 
} 

程序的輸出是:

18:48:02.411 [main] INFO - Obtained 6791 TLD from packaged file tld-names.txt 
18:48:02.441 [main] INFO - Deleted contents of: /home/scraper/test/frontier (as you have configured resumable crawling to false) 
18:48:02.636 [main] INFO - Crawler 1 started 
18:48:02.636 [Crawler 1] INFO - Crawler Crawler 1 started! 
Adding link 
Adding link 
Adding link 
Adding link 
18:48:02.685 [Crawler 1] WARN - Skipping URL: http://localhost/test, StatusCode: 404, text/html; charset=iso-8859-1, Not Found 
18:48:03.642 [Crawler 1] WARN - Skipping URL: http://localhost/test2/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found 
18:48:04.642 [Crawler 1] WARN - Skipping URL: http://localhost/test3/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found 
18:48:05.643 [Crawler 1] WARN - Skipping URL: http://localhost/test4/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found 
18:48:06.642 [Crawler 1] WARN - Skipping URL: http://localhost/test5/, StatusCode: 404, text/html; charset=iso-8859-1, Not Found 
Adding link 
Adding link 
Adding link 
Adding link 
Adding link 
Adding link 
Adding link 
Adding link 

爲什麼crawler4j不參觀TEST6,TEST7及以上?

正如你所看到的,它們之前的所有4個鏈接都被添加並正確訪問。

當我設置「http://localhost/」爲seedUrl(在啓動爬蟲之前),它處理多達13個鏈接,然後出現上述問題。

我想獲得的是一種情況,當我可以添加網址從其他線程(運行時)訪問運行爬蟲。

@EDIT: 我已經從@Seth的建議看過線程轉儲,但我找不到它爲什麼不起作用。

"Thread-1" #25 prio=5 os_prio=0 tid=0x00007ff32854b800 nid=0x56e3 waiting on condition [0x00007ff2de403000] 
    java.lang.Thread.State: TIMED_WAITING (sleeping) 
    at java.lang.Thread.sleep(Native Method) 
    at edu.uci.ics.crawler4j.crawler.CrawlController.sleep(CrawlController.java:367) 
    at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:243) 
    - locked <0x00000005959baff8> (a java.lang.Object) 
    at java.lang.Thread.run(Thread.java:745) 

    Locked ownable synchronizers: 
    - None 

"Crawler 1" #24 prio=5 os_prio=0 tid=0x00007ff328544000 nid=0x56e2 in Object.wait() [0x00007ff2de504000] 
    java.lang.Thread.State: WAITING (on object monitor) 
    at java.lang.Object.wait(Native Method) 
    - waiting on <0x0000000596afdd28> (a java.lang.Object) 
    at java.lang.Object.wait(Object.java:502) 
    at edu.uci.ics.crawler4j.frontier.Frontier.getNextURLs(Frontier.java:151) 
    - locked <0x0000000596afdd28> (a java.lang.Object) 
    at edu.uci.ics.crawler4j.crawler.WebCrawler.run(WebCrawler.java:259) 
    at java.lang.Thread.run(Thread.java:745) 

    Locked ownable synchronizers: 
    - None 
+0

相同爲什麼使用限制爲4的for循環? – Seth

+0

@Seth我想模擬添加4個鏈接從遠程源爬行。這個數字在這裏並不重要。 –

+0

我注意到,我目前正試圖爲您的抓取工具感受一下,這就是我問的原因。 – Seth

回答