如何在爬網之後減少/更改延遲？

我跟着the project page的例子來實現我自己的抓取工具。爬蟲工作正常，爬行速度非常快。唯一的是我總是有20-30秒的延遲。有沒有辦法避免等待時間？

2014-03-12 user3411187

你的意思是處理或等待時間？我唯一知道的等待相關設置是「[禮貌延遲]（https://code.google.com/p/crawler4j/wiki/Configurations#Politeness）」。 –

剛查過crawler4j source code。 CrawerController.start方法有很多固定的10秒「暫停」，以確保線程完成並準備好清理。

// Make sure again that none of the threads 
// are 
// alive. 
logger.info("It looks like no thread is working, waiting for 10 seconds to make sure..."); 
sleep(10); 

// ... more code ... 

logger.info("No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure..."); 
sleep(10); 

// ... more code ... 

logger.info("Waiting for 10 seconds before final clean up..."); 
sleep(10);

而且，每10秒主循環檢查知道爬行線程完成：

while (true) { 
    sleep(10); 
    // code to check if some thread is still working 
} 

protected void sleep(int seconds) { 
    try { 
     Thread.sleep(seconds * 1000); 
    } catch (Exception ignored) { 
    } 
}

所以它可能是值得進行微調這些電話，並減少了睡眠時間。

一個更好的解決方案，如果你可以節省一些時間，將是重寫這種方法。我會用ExecutorService代替List<Thread> threads，它的awaitTermination方法會特別方便。與睡眠不同，如果所有任務都完成，awaitTermination(10, TimeUnit.SECONDS)將立即返回。

來源

2014-05-02 17:01:19

如何在爬網之後減少/更改延遲？

回答

相關問題