需要幫助獲取Java中的網站的HTML

我從java httpurlconnection cutting off html得到了一些代碼，我幾乎是從Java中的網站獲取html的代碼。除了一個特定的網站，我無法再使用此代碼的工作：需要幫助獲取Java中的網站的HTML

我試圖從該網站獲得HTML：

http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289

但我不斷收到垃圾字符。雖然它可以很好地與任何其他網站，如http://www.google.com。

這是我使用的代碼：

public static String PrintHTML(){ 
    URL url = null; 
    try { 
     url = new URL("http://www.geni.com/genealogy/people/William-Jefferson-Blythe-Clinton/6000000001961474289"); 
    } catch (MalformedURLException e1) { 
     // TODO Auto-generated catch block 
     e1.printStackTrace(); 
    } 
    HttpURLConnection connection = null; 
    try { 
     connection = (HttpURLConnection) url.openConnection(); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6"); 
    try { 
     System.out.println(connection.getResponseCode()); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    String line; 
    StringBuilder builder = new StringBuilder(); 
    BufferedReader reader = null; 
    try { 
     reader = new BufferedReader(new InputStreamReader(connection.getInputStream())); 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    try { 
     while ((line = reader.readLine()) != null) { 
      builder.append(line); 
      builder.append("\n"); 
     } 
    } catch (IOException e) { 
     // TODO Auto-generated catch block 
     e.printStackTrace(); 
    } 
    String html = builder.toString(); 
    System.out.println("HTML " + html); 
    return html; 
}

我不明白爲什麼它不與我上面提到的網址工作。

任何幫助將不勝感激。

來源

2010-08-04 bits

無論客戶端的能力如何，該網站都會錯誤地迴應響應。通常情況下，服務器只應在客戶端支持的情況下gzip響應（由Accept-Encoding: gzip）。您需要使用GZIPInputStream來解壓縮它。

reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(connection.getInputStream()), "UTF-8"));

請注意，我還將正確的字符集添加到InputStreamReader的構造函數中。通常情況下，您想從響應的Content-Type標題中提取它。

欲瞭解更多提示，另請參閱How to use URLConnection to fire and handle HTTP requests?如果您畢竟想要的是從HTML中解析/提取信息，那麼我強烈建議您使用類似Jsoup的HTML parser。

來源

2010-08-04 14:06:46 BalusC

哇它的工作。感謝您的解釋。並感謝該片段。我最初嘗試使用HTMLCleaner作爲我的解析器，但我遇到了同樣的問題。現在我將把這個HTML字符串提供給HTMLCleaner。 – bits 2010-08-04 14:20:06

不客氣。 – BalusC 2010-08-04 14:20:35

順便說一句，當使用Jsoup.connect（url）.get（）時，jsoup（1.3.1）現在可以正確處理gzip的輸出; – 2010-08-23 10:20:50

需要幫助獲取Java中的網站的HTML

回答

相關問題