如何從（內部）https頁面中刮取html內容

我試圖將源代碼下載到我的Intranet上的頁面。我可以訪問所有的瀏覽器頁面，而無需顯式登錄如何從（內部）https頁面中刮取html內容

當我嘗試下面的代碼來獲取頁面內容時，出現下面的錯誤代碼：

public scrape() throws IOException{ 

    String httpsURL = "https://myurl.aspx"; 
    URL myurl = new URL(httpsURL); 
    HttpsURLConnection con = (HttpsURLConnection)myurl.openConnection(); 
    InputStream ins = con.getInputStream(); //breaks here 
    InputStreamReader isr = new InputStreamReader(ins); 
    BufferedReader in = new BufferedReader(isr); 

    String inputLine; 

    while ((inputLine = in.readLine()) != null) 
    { 
     System.out.println(inputLine); 
    } 

    in.close(); 

}

錯誤：線程「main」中的異常java.io.IOException：服務器返回的HTTP響應代碼：500 for URL：https://myurl.aspx

它專門打破了一行 - > InputStream ins = con.getInputStream（）;

我不知道如何糾正這一點，任何想法？

來源

2012-04-30 rockit

你顯然提供無效的查詢參數，請求正文或uri。按照通常使用chrome調試器或firebug啓用的頁面打開頁面，並確切地查看它的訪問URL，它提供的參數/頭文件，以及請求正文中的內容。您也可以嘗試查看500的響應主體，看看它是否包含任何有用的信息 – nsfyn55

首先要做的是，在nsfyn55的評論中，使用瀏覽器來檢查標題。有些網站在返回響應之前檢查User-Agent HTTP標頭。第二件事情是，使用HTTPS時，您需要正確初始化安全層。請選擇此類：

public class SSLConfiguration { 

    private static boolean isSslInitialized = false; 
    private static final String PROTOCOL = "SSL"; 
    public static boolean ACCEPT_ALL_CERTS = true; 

    public static void initializeSSLConnection() { 
     if (!isSslInitialized) { 
      if (ACCEPT_ALL_CERTS) { 
       initInsecure(); 
      } else { 
       initSsl(); 
      } 
     } 
    } 

    private static void initInsecure() { 
     TrustManager[] trustAllCerts = new TrustManager[]{ 
      new X509TrustManager() { 

       @Override 
       public java.security.cert.X509Certificate[] getAcceptedIssuers() { 
        return null; 
       } 

       @Override 
       public void checkClientTrusted(
         java.security.cert.X509Certificate[] certs, String authType) { 
       } 

       @Override 
       public void checkServerTrusted(
         java.security.cert.X509Certificate[] certs, String authType) { 
       } 
      } 
     }; 

     // Install the all-trusting trust manager 
     try { 
      SSLContext sc = SSLContext.getInstance(PROTOCOL); 
      sc.init(null, trustAllCerts, new java.security.SecureRandom()); 
      HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory()); 
     } catch (Exception e) { 
     } 
     HttpsURLConnection.setDefaultHostnameVerifier(
       new HostnameVerifier() { 

        @Override 
        public boolean verify(String string, SSLSession ssls) { 
         return true; 
        } 
       }); 
     isSslInitialized = true; 
    } 

    private static void initSsl() { 
     SSLContext sc = null; 
     try { 
      sc = SSLContext.getInstance(PROTOCOL); 
     } catch (NoSuchAlgorithmException ex) { 
      throw new RuntimeException(ex); 
     } 
     try { 
      sc.init(null, null, new SecureRandom()); 
     } catch (KeyManagementException ex) { 
      throw new RuntimeException(ex); 
     } 
     HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory()); 
     HostnameVerifier hv = new HostnameVerifier() { 

      @Override 
      public boolean verify(String urlHostName, SSLSession session) { 
       /* This is to avoid spoofing */ 
       return (urlHostName.equals(session.getPeerHost())); 
      } 
     }; 

     HttpsURLConnection.setDefaultHostnameVerifier(hv); 
     isSslInitialized = true; 
    } 
}

極有可能連接失敗 - 特別是如果網站沒有合適的證書。在你的代碼，你的類的構造函數中，插入下面的一段代碼：

SSLConfiguration.initializeSSLConnection();

做更多的事情要考慮 - openConnection後，建議您添加以下內容：

con.setRequestMethod(METHOD); 
con.setDoInput(true); 
con.setDoOutput(true); 
con.setUseCaches(false);

我傾斜但是要相信，既然你從遠程服務器得到了一個響應，那麼指定正確的頭文件尤其如此，特別是User-Agent和Accept。如果上述內容不能幫助您解決問題，請打印錯誤的堆棧跟蹤並從遠程讀取錯誤流以獲取更有意義的錯誤消息。如果你使用Firefox，Live HTTP Headers是一個非常方便的解決方案。在使用HTTP請求時，cURL也是最重要的命令行工具。

來源

2012-04-30 18:32:39

感謝您的嘗試，但我仍然無法下載頁面的源代碼。 – rockit

@rockit如果你給我的URL，我可以嘗試找到一個解決方案 - 應該不難。你能發佈更多的信息嗎？你的異常的堆棧跟蹤？ –

我想我被直接阻止..謝謝，雖然，我接受了你的答案，因爲它在技術上是正確的 – rockit

如何從（內部）https頁面中刮取html內容

回答

相關問題