2012-09-02 87 views
1

我想從下面的鏈接中下載Google https的網頁內容。Java:從Google下載網頁內容https

link to download

與下面的代碼,我首先禁用用於測試目的證書的驗證和信任所有證書,然後下載網絡作爲普通的HTTP,但出於某種原因,沒有成功:

public static void downloadWeb() { 
     // Create a new trust manager that trust all certificates 
     TrustManager[] trustAllCerts = new TrustManager[] { new X509TrustManager() { 
      public java.security.cert.X509Certificate[] getAcceptedIssuers() { 
       return null; 
      } 

      public void checkClientTrusted(
        java.security.cert.X509Certificate[] certs, String authType) { 
      } 

      public void checkServerTrusted(
        java.security.cert.X509Certificate[] certs, String authType) { 
      } 
     } }; 

    // Activate the new trust manager 
     try { 
      SSLContext sc = SSLContext.getInstance("SSL"); 
      sc.init(null, trustAllCerts, new java.security.SecureRandom()); 
      HttpsURLConnection 
        .setDefaultSSLSocketFactory(sc.getSocketFactory()); 
     } catch (Exception e) {} 

      //begin download as regular http 
     try { 
      String wordAddress = "https://www.google.com/webhp?hl=en&tab=ww#hl=en&tbs=dfn:1&sa=X&ei=obxCUKm7Ic3GqAGvoYGIBQ&ved=0CDAQBSgA&q=pronunciation&spell=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.&fp=c5bfe0fbd78a3271&biw=1024&bih=759"; 
      URLConnection yc = new URL(wordAddress).openConnection(); 
      BufferedReader in = new BufferedReader(new InputStreamReader(
        yc.getInputStream())); 
      String inputLine = ""; 
      while ((inputLine = in.readLine()) != null) { 
       System.out.println(wordAddress); 
      } 

     } catch (IOException e) {} 

    } 
+0

你必須使用Java嗎? –

+0

是的,但你有其他語言的建議嗎? – DavidNg

+0

如果我不必使用Java,我會使用'wget'或'cURL'並創建一個shell腳本(或批處理文件)。 –

回答

1

您需要僞造HTTP標頭,以便Google認爲您正在從Web瀏覽器下載它。下面是使用HttpClient一個示例代碼:

import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import org.apache.http.HttpResponse; 
import org.apache.http.client.HttpClient; 
import org.apache.http.client.methods.HttpGet; 
import org.apache.http.impl.client.DefaultHttpClient; 

public class App1 { 

    public static void main(String[] args) throws IOException { 
     HttpClient httpclient = new DefaultHttpClient(); 
     HttpGet httpget = new HttpGet("http://_google_url_"); 
     httpget.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"); 
     HttpResponse execute = httpclient.execute(httpget); 
     File file = new File("google.html"); 
     FileOutputStream fout = null; 
     try { 
      fout = new FileOutputStream(file); 
      execute.getEntity().writeTo(fout); 
     } finally { 
      if (fout != null) { 
       fout.close(); 
      } 
     } 
    } 
} 

警告,我不負責,如果您使用此代碼,並違反服務協議,谷歌的任期。

+0

謝謝,我只是提出一些疑問。我需要從Apache安裝一些東西嗎? – DavidNg

+0

HttpClient和HttpCore從上面的頁面鏈接中下載鏈接。 – gigadot

+0

我現在可以從Google獲取內容,但我想要的是Dictionary頁面。例如,單詞「pronuncation」的詞典頁面https://www.google.com/search?q=download+pronunciation+google+java&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US :官方與客戶端= firefox的-A#HL = EN與客戶端= firefox的-A&RLS = org.mozilla:EN-US:官方&q =發音與TBS = DFN:1 TBO = U&SA = X&EI = wKdCUPnfJ4nZqgHi1IGADw&VED = 0CB0QkQ4&FP = 1&BIW = 1280&波黑= 721&BAV = on.2,或.r_gc.r_pw.r_cp.r_qf。&cad = b&sei = IKlCUMbtOIGGqgH9xYGoDA – DavidNg