我想在crawler4j中使用基本的爬蟲示例。我從crawler4j網站here獲取了代碼。爲什麼crawler4j示例會出錯?
package edu.crawler;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import java.util.List;
import java.util.regex.Pattern;
import org.apache.http.Header;
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/**
* You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic).
*/
@Override
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
@Override
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String domain = page.getWebURL().getDomain();
String path = page.getWebURL().getPath();
String subDomain = page.getWebURL().getSubDomain();
String parentUrl = page.getWebURL().getParentUrl();
String anchor = page.getWebURL().getAnchor();
System.out.println("Docid: " + docid);
System.out.println("URL: " + url);
System.out.println("Domain: '" + domain + "'");
System.out.println("Sub-domain: '" + subDomain + "'");
System.out.println("Path: '" + path + "'");
System.out.println("Parent page: " + parentUrl);
System.out.println("Anchor text: " + anchor);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
List<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
Header[] responseHeaders = page.getFetchResponseHeaders();
if (responseHeaders != null) {
System.out.println("Response headers:");
for (Header header : responseHeaders) {
System.out.println("\t" + header.getName() + ": " + header.getValue());
}
}
System.out.println("=============");
}
}
以上是本示例中爬蟲類的代碼。
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "../data/";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
以上是網絡爬蟲控制器類的類。 當我嘗試從我的IDE(的IntelliJ)我得到以下錯誤運行Controller類:
Exception in thread "main" java.lang.UnsupportedClassVersionError: edu/uci/ics/crawler4j/crawler/CrawlConfig : Unsupported major.minor version 51.0
有一些事是發現here我應該知道Maven的配置?我必須使用不同的版本或其他東西嗎?
從它的聲音中,您試圖執行在Java的更高版本上編譯的代碼版本,然後是您正在運行的那個版本。例如。該代碼是使用Java 7編譯的,並且正在運行Java 6,或者使用Java 6進行編譯,並且正在運行Java 5 ... – MadProgrammer 2013-03-14 00:29:45
檢出http://stackoverflow.com/questions/10382929/unsupported-major-minor-version -51-0 – Farlan 2013-03-14 00:34:08
@hey j.jerrod taylor ..我在非常基本的程序中遇到問題。我在線程「main」java.lang.NoClassDefFoundError中獲取異常異常:org/apache/http/client/methods/HttpUriRequest \t在com.crawler.web.BasicCrawlController.main(BasicCrawlController.java:78) 產生的原因:拋出java.lang.ClassNotFoundException:org.apache.http.client.methods.HttpUriRequest,請建議是否有其他罐是也是必需的。 – 2013-06-14 16:47:29