2011-07-08 24 views
0

我想要獲取所有具有標頭的url作爲Content-Type:text/html,所以我正在檢查每個url的響應標題,並且如果它們具有content-type:text/html,那麼我想打印具有content-type:text/html的url。但在我的代碼中,當我檢查頭部是否具有Content-Type時,它不顯示任何內容。並且,如果我刪除了if循環,則會打印每個與要抓取的特定url相關的鏈接及其響應頭..檢查響應標題時只抓取HTML頁面

public class MyCrawler extends WebCrawler { 

    Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g" 
      + "|png|tiff?|mid|mp2|mp3|mp4" + "|wav|avi|mov|mpeg|ram|m4v|pdf" 
      + "|rm|smil|wmv|swf|wma|zip|rar|gz))$"); 


    /* 
    Pattern filters = Pattern.compile("(\\.(html))"); 
*/ 
    public MyCrawler() { 
    } 

    public boolean shouldVisit(WebURL url) { 
     String href = url.getURL().toLowerCase(); 
     //System.out.println("Href: " +href); 
     /* 
     if (filters.matcher(href).matches()) { 
      return false; 
     }*/ 
     if (href.startsWith("http://www.somehost.com/")) { 
      return true; 
     } 
     return false; 
    } 

    public void visit(Page page) { 

     int docid = page.getWebURL().getDocid(); 

     String url = page.getWebURL().getURL();   
     String text = page.getText(); 
     List<WebURL> links = page.getURLs(); 
     int parentDocid = page.getWebURL().getParentDocid(); 


     //HttpGet httpget = new HttpGet(url); 


     try { 
      URL url1 = new URL(url); 
      URLConnection connection = url1.openConnection(); 

      Map responseMap = connection.getHeaderFields(); 
     for (Iterator iterator = responseMap.keySet().iterator(); iterator.hasNext();) 
    { 
       String key = (String) iterator.next(); 
       if(key==("Content-Type")) //(Anything wrong with this if loop) 
       { 
       System.out.print(key + " = "); 

       List values = (List) responseMap.get(key); 
       for (int i = 0; i < values.size(); i++) { 
        Object o = values.get(i); 
        System.out.print(o + ", "); 
       } 
       System.out.println(""); 
System.out.println(url1); 
       } 

      } 
     } catch (MalformedURLException e) { 
      e.printStackTrace(); 
     } catch (IOException e) { 
      e.printStackTrace(); 
     } 


     //System.out.println("Docid: " + docid); 
     //System.out.println("URL: " + url); 
     //System.out.println("Text length: " + text.length()); 
     //System.out.println("Number of links: " + links.size()); 
     //System.out.println("Docid of parent page: " + parentDocid); 
     System.out.println("============="); 
    } 
} 

回答

2

關鍵變量工作包括:

Content-Type=[text/html; charset=ISO-8859-1]

和爲此不能==.equals("Content-Type")

如果您嘗試被抓運行以下代碼,查看它打印出來的內容

URLConnection connection = url1.openConnection(); 

Map responseMap = connection.getHeaderFields(); 
Iterator iterator = responseMap.entrySet().iterator(); 
while (iterator.hasNext()) 
{ 
    String key = iterator.next().toString(); 
    if (key.contains("Content-Type")) 
    { 
     System.out.println(key); 
     // Content-Type=[text/html; charset=ISO-8859-1] 
     if (filters.matcher(key) != null){ 
      System.out.println(url1); 
      // http://google.com 
     } 
    } 
} 

這裏是輸出:

Content-Type=[text/html; charset=ISO-8859-1] 
http://google.com 

看起來你也只是一個做if語句如下:

while (iterator.hasNext()) 
{ 
    String key = iterator.next().toString(); 
    if (key.contains("text/html")) 
    { 
     System.out.println(url1); 
     // http://google.com 
    } 
} 

在Java中is very intuitive BTW字符串比較,讓我每時每刻!

+0

感謝這個例子.. :)如果我想要其內容類型不是text/html的網址。那麼可以做些什麼。在這裏,我們通過這個'if(key.contains(「text/html」))'來檢查html。 – ferhan

+1

是的,如果你想找到除html以外的東西,你可以寫if(key.contains(「image/png」))。或者使用filters.matcher(key),其中過濾器包含允許的內容類型 –

0

對於字符串比較,使用.equals()

0

它應與

if (key != null && key.equals("Content-Type"))