我想要獲取所有具有標頭的url作爲Content-Type:text/html,所以我正在檢查每個url的響應標題,並且如果它們具有content-type:text/html,那麼我想打印具有content-type:text/html的url。但在我的代碼中,當我檢查頭部是否具有Content-Type時,它不顯示任何內容。並且,如果我刪除了if循環,則會打印每個與要抓取的特定url相關的鏈接及其響應頭..檢查響應標題時只抓取HTML頁面
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
/*
Pattern filters = Pattern.compile("(\\.(html))");
*/
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//System.out.println("Href: " +href);
/*
if (filters.matcher(href).matches()) {
return false;
}*/
if (href.startsWith("http://www.somehost.com/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
//HttpGet httpget = new HttpGet(url);
try {
URL url1 = new URL(url);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();)
{
String key = (String) iterator.next();
if(key==("Content-Type")) //(Anything wrong with this if loop)
{
System.out.print(key + " = ");
List values = (List) responseMap.get(key);
for (int i = 0; i < values.size(); i++) {
Object o = values.get(i);
System.out.print(o + ", ");
}
System.out.println("");
System.out.println(url1);
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
//System.out.println("Docid: " + docid);
//System.out.println("URL: " + url);
//System.out.println("Text length: " + text.length());
//System.out.println("Number of links: " + links.size());
//System.out.println("Docid of parent page: " + parentDocid);
System.out.println("=============");
}
}
感謝這個例子.. :)如果我想要其內容類型不是text/html的網址。那麼可以做些什麼。在這裏,我們通過這個'if(key.contains(「text/html」))'來檢查html。 – ferhan
是的,如果你想找到除html以外的東西,你可以寫if(key.contains(「image/png」))。或者使用filters.matcher(key),其中過濾器包含允許的內容類型 –