2011-12-23 113 views
9

我只想從Java獲取任何網頁的源代碼。我發現很多的解決方案,到目前爲止,但我無法找到下面的所有環節工作的任何代碼:如何從Java獲取網頁的源代碼

對我來說,主要問題是一些代碼檢索網頁源代碼,但缺少一些。例如下面的代碼不適用於第一個鏈接。

InputStream is = fURL.openStream(); //fURL can be one of the links above 
BufferedReader buffer = null; 
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9")); 

int byteRead; 
while ((byteRead = buffer.read()) != -1) { 
    builder.append((char) byteRead); 
} 
buffer.close(); 
System.out.println(builder.toString()); 
+1

請注意,您只能獲得打開網址時最初傳送的源代碼。可能會有額外的內容通過AJAX加載,並且當您剛剛閱讀初始流時,您不會看到該內容。 - 例如,在Firefox中打開http://demo.vaadin.com/sampler,然後打開頁面源代碼。您將無法看到所有顯示內容的來源。 – Thomas

+0

@cerq:根據您對*「網頁源代碼」的定義*,您可以或不可以這樣做。例如,可以認爲,由* .jsp *生成的網頁的「源代碼」是* .jsp *文件本身,而不是**生成的HTML ...您要做什麼是HTML,而不是「源代碼」。在許多情況下,「源代碼」位於服務器上,並且很少盜用服務器,您根本無法訪問它。 – TacticalCoder

+0

@Thomas我認爲我的問題是關於你所說的事情。那麼有什麼辦法可以讓所有顯示的內容來源? – brtb

回答

22

嘗試下面的代碼與添加的請求屬性:

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.net.URL; 
import java.net.URLConnection; 

public class SocketConnection 
{ 
    public static String getURLSource(String url) throws IOException 
    { 
     URL urlObject = new URL(url); 
     URLConnection urlConnection = urlObject.openConnection(); 
     urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); 

     return toString(urlConnection.getInputStream()); 
    } 

    private static String toString(InputStream inputStream) throws IOException 
    { 
     try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"))) 
     { 
      String inputLine; 
      StringBuilder stringBuilder = new StringBuilder(); 
      while ((inputLine = bufferedReader.readLine()) != null) 
      { 
       stringBuilder.append(inputLine); 
      } 

      return stringBuilder.toString(); 
     } 
    } 
} 
+0

您的代碼和我寫的代碼都無法工作鏈接http://www.cumhuriyet.com.tr?hn=298710請先測試您的代碼。 – brtb

+2

System.out.println(getUrlSource(「http://cumhuriyet.com.tr/?hn=298710」));沒關係 –

1
URL yahoo = new URL("http://www.yahoo.com/"); 
BufferedReader in = new BufferedReader(
      new InputStreamReader(
      yahoo.openStream())); 

String inputLine; 

while ((inputLine = in.readLine()) != null) 
    System.out.println(inputLine); 

in.close(); 
+0

我不想要一個適用於yahoo.com或google.com的代碼,請檢查我的帖子兩次 – brtb

3

我相信你已經找到了在過去2年某處的解決方案,但下面是一個可行的解決方案爲您所要求的網站提供服務

package javasandbox; 

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.HttpURLConnection; 
import java.net.MalformedURLException; 
import java.net.URL; 

/** 
* 
* @author Ryan.Oglesby 
*/ 
public class JavaSandbox { 

private static String sURL; 

/** 
* @param args the command line arguments 
*/ 
public static void main(String[] args) throws MalformedURLException, IOException { 
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710"; 
    System.out.println(sURL); 
    URL url = new URL(sURL); 
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection(); 
    //set http request headers 
      httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr"); 
      httpCon.addRequestProperty("Connection", "keep-alive"); 
      httpCon.addRequestProperty("Cache-Control", "max-age=0"); 
      httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); 
      httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"); 
      httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch"); 
      httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8"); 
      //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body"); 
      HttpURLConnection.setFollowRedirects(false); 
      httpCon.setInstanceFollowRedirects(false); 
      httpCon.setDoOutput(true); 
      httpCon.setUseCaches(true); 

      httpCon.setRequestMethod("GET"); 

      BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8")); 
      String inputLine; 
      StringBuilder a = new StringBuilder(); 
      while ((inputLine = in.readLine()) != null) 
       a.append(inputLine); 
      in.close(); 

      System.out.println(a.toString()); 

      httpCon.disconnect(); 
} 
} 
+0

幫助永遠不會太晚。但是我嘗試了你的代碼,它在很多網頁中都不起作用。 –

+1

我同意這部分不會針對所有網頁,因爲不同的網頁以不同的格式返回數據,在某些情況下,您可能需要重新定向。在某些情況下,您可能會收到響應作爲gzip響應,您可以按如下所示處理它:InputStream gzippedResponse = httpCon.getInputStream(); InputStream ungzippedResponse = new GZIPInputStream(gzippedResponse); InputStreamReader reader = new InputStreamReader(ungzippedResponse,「UTF-8」); StringWriter writer = new StringWriter();' – Roglesby