嗨,我想在Java中創建一個網絡爬蟲中,我想以檢索從網頁類似標題的一些數據,描述和DATAS存儲在數據庫如何在java中創建網絡爬蟲?
0
A
回答
0
2
如果你想要做自己使用在Android API附帶HttpClient。
HttpClient的使用示例(您只需要分析出:
public class HttpTest {
public static void main(String... args)
throws ClientProtocolException, IOException {
crawlPage("http://www.google.com/");
}
static Set<String> checked = new HashSet<String>();
private static void crawlPage(String url) throws ClientProtocolException, IOException {
if (checked.contains(url))
return;
checked.add(url);
System.out.println("Crawling: " + url);
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet("http://www.google.com");
HttpResponse response = client.execute(request);
Reader reader = null;
try {
reader = new InputStreamReader(response.getEntity().getContent());
Links links = new Links();
new ParserDelegator().parse(reader, links, true);
for (String link : links.list)
if (link.startsWith("http://"))
crawlPage(link);
} finally {
if (reader != null) {
try {
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
static class Links extends HTMLEditorKit.ParserCallback {
List<String> list = new LinkedList<String>();
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
if (t == HTML.Tag.A)
list.add(a.getAttribute(HTML.Attribute.HREF).toString());
}
}
}
1
您可以使用crawler4j Crawler4j是一個開源的Java爬蟲它提供了一個簡單的界面,檢索網頁,您可以設置一個。在幾個小時的多線程網絡爬蟲
0
您可以使用WebCollector:https://github.com/CrawlScript/WebCollector
演示基於WebCollector 2.05:
import cn.edu.hfut.dmic.webcollector.crawler.BreadthCrawler;
import cn.edu.hfut.dmic.webcollector.model.Links;
import cn.edu.hfut.dmic.webcollector.model.Page;
import java.util.regex.Pattern;
import org.jsoup.nodes.Document;
/**
* Crawl news from yahoo news
*
* @author hu
*/
public class YahooCrawler extends BreadthCrawler {
/**
* @param crawlPath crawlPath is the path of the directory which maintains
* information of this crawler
* @param autoParse if autoParse is true,BreadthCrawler will auto extract
* links which match regex rules from pag
*/
public YahooCrawler(String crawlPath, boolean autoParse) {
super(crawlPath, autoParse);
/*start page*/
this.addSeed("http://news.yahoo.com/");
/*fetch url like http://news.yahoo.com/xxxxx*/
this.addRegex("http://news.yahoo.com/.*");
/*do not fetch url like http://news.yahoo.com/xxxx/xxx)*/
this.addRegex("-http://news.yahoo.com/.+/.*");
/*do not fetch jpg|png|gif*/
this.addRegex("-.*\\.(jpg|png|gif).*");
/*do not fetch url contains #*/
this.addRegex("-.*#.*");
}
@Override
public void visit(Page page, Links nextLinks) {
String url = page.getUrl();
/*if page is news page*/
if (Pattern.matches("http://news.yahoo.com/.+html", url)) {
/*we use jsoup to parse page*/
Document doc = page.getDoc();
/*extract title and content of news by css selector*/
String title = doc.select("h1[class=headline]").first().text();
String content = doc.select("div[class=body yom-art-content clearfix]").first().text();
System.out.println("URL:\n" + url);
System.out.println("title:\n" + title);
System.out.println("content:\n" + content);
/*If you want to add urls to crawl,add them to nextLink*/
/*WebCollector automatically filters links that have been fetched before*/
/*If autoParse is true and the link you add to nextLinks does not match the regex rules,the link will also been filtered.*/
// nextLinks.add("http://xxxxxx.com");
}
}
public static void main(String[] args) throws Exception {
YahooCrawler crawler = new YahooCrawler("crawl", true);
crawler.setThreads(50);
crawler.setTopN(100);
//crawler.setResumable(true);
/*start crawl with depth of 4*/
crawler.start(4);
}
}
相關問題
- 1. java網絡爬蟲
- 2. 如何在java中創建網絡爬蟲
- 3. 網絡爬蟲的Java
- 4. 網絡爬蟲
- 5. C++網絡爬蟲
- 6. PHP網絡爬蟲
- 7. Python網絡爬蟲
- 8. 網絡爬蟲類
- 9. Python中的網絡爬蟲
- 10. JAVA網絡爬蟲。 java.out.lang.outofmemory無法創建本地線程
- 11. 用Java編寫網絡爬蟲
- 12. 如何在ASP.NET中創建Web爬蟲?
- 13. 構建自動網絡爬蟲
- 14. 自動網絡爬蟲
- 15. 網絡爬蟲的功能
- 16. 網絡爬蟲,反饋?
- 17. 遞歸網絡爬蟲perl
- 18. 簡單的網絡爬蟲
- 19. 需要網絡爬蟲
- 20. 網絡爬蟲文本雲
- 21. 硒與python網絡爬蟲
- 22. 網絡爬蟲從Android Market
- 23. 網絡爬蟲應用
- 24. 網絡爬蟲不打印
- 25. 網絡爬蟲提取
- 26. 網絡爬蟲如何處理javascript
- 27. 如何配置網絡爬蟲?
- 28. 如何使網絡爬蟲更有效?
- 29. 如何識別網絡爬蟲?
- 30. 如何忽略網絡爬蟲?
我喜歡HtmlUnit,但我不知道它在Android上的工作效果如何...... – MatrixFrog 2010-11-09 06:54:58
告訴我如何使用HtmlUnit創建Web爬蟲。首先,我想解析一些數據並將其存儲在數據庫中。 – 2010-11-09 07:16:43