如何從Java獲取網頁的源代碼

我只想從Java獲取任何網頁的源代碼。我發現很多的解決方案，到目前爲止，但我無法找到下面的所有環節工作的任何代碼：如何從Java獲取網頁的源代碼

對我來說，主要問題是一些代碼檢索網頁源代碼，但缺少一些。例如下面的代碼不適用於第一個鏈接。

InputStream is = fURL.openStream(); //fURL can be one of the links above 
BufferedReader buffer = null; 
buffer = new BufferedReader(new InputStreamReader(is, "iso-8859-9")); 

int byteRead; 
while ((byteRead = buffer.read()) != -1) { 
    builder.append((char) byteRead); 
} 
buffer.close(); 
System.out.println(builder.toString());

來源

2011-12-23 brtb

請注意，您只能獲得打開網址時最初傳送的源代碼。可能會有額外的內容通過AJAX加載，並且當您剛剛閱讀初始流時，您不會看到該內容。 - 例如，在Firefox中打開http://demo.vaadin.com/sampler，然後打開頁面源代碼。您將無法看到所有顯示內容的來源。 – Thomas

@cerq：根據您對*「網頁源代碼」的定義*，您可以或不可以這樣做。例如，可以認爲，由* .jsp *生成的網頁的「源代碼」是* .jsp *文件本身，而不是**生成的HTML ...您要做什麼是HTML，而不是「源代碼」。在許多情況下，「源代碼」位於服務器上，並且很少盜用服務器，您根本無法訪問它。 – TacticalCoder

@Thomas我認爲我的問題是關於你所說的事情。那麼有什麼辦法可以讓所有顯示的內容來源？ – brtb

嘗試下面的代碼與添加的請求屬性：

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStream; 
import java.io.InputStreamReader; 
import java.net.URL; 
import java.net.URLConnection; 

public class SocketConnection 
{ 
    public static String getURLSource(String url) throws IOException 
    { 
     URL urlObject = new URL(url); 
     URLConnection urlConnection = urlObject.openConnection(); 
     urlConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); 

     return toString(urlConnection.getInputStream()); 
    } 

    private static String toString(InputStream inputStream) throws IOException 
    { 
     try (BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, "UTF-8"))) 
     { 
      String inputLine; 
      StringBuilder stringBuilder = new StringBuilder(); 
      while ((inputLine = bufferedReader.readLine()) != null) 
      { 
       stringBuilder.append(inputLine); 
      } 

      return stringBuilder.toString(); 
     } 
    } 
}

來源

2011-12-23 13:46:47

您的代碼和我寫的代碼都無法工作鏈接http://www.cumhuriyet.com.tr?hn=298710請先測試您的代碼。 – brtb

System.out.println（getUrlSource（「http://cumhuriyet.com.tr/?hn=298710」））;沒關係 –

URL yahoo = new URL("http://www.yahoo.com/"); 
BufferedReader in = new BufferedReader(
      new InputStreamReader(
      yahoo.openStream())); 

String inputLine; 

while ((inputLine = in.readLine()) != null) 
    System.out.println(inputLine); 

in.close();

來源

2011-12-23 13:51:54 subodh

我不想要一個適用於yahoo.com或google.com的代碼，請檢查我的帖子兩次 – brtb

我相信你已經找到了在過去2年某處的解決方案，但下面是一個可行的解決方案爲您所要求的網站提供服務

package javasandbox; 

import java.io.BufferedReader; 
import java.io.IOException; 
import java.io.InputStreamReader; 
import java.net.HttpURLConnection; 
import java.net.MalformedURLException; 
import java.net.URL; 

/** 
* 
* @author Ryan.Oglesby 
*/ 
public class JavaSandbox { 

private static String sURL; 

/** 
* @param args the command line arguments 
*/ 
public static void main(String[] args) throws MalformedURLException, IOException { 
    sURL = "http://www.cumhuriyet.com.tr/?hn=298710"; 
    System.out.println(sURL); 
    URL url = new URL(sURL); 
    HttpURLConnection httpCon = (HttpURLConnection) url.openConnection(); 
    //set http request headers 
      httpCon.addRequestProperty("Host", "www.cumhuriyet.com.tr"); 
      httpCon.addRequestProperty("Connection", "keep-alive"); 
      httpCon.addRequestProperty("Cache-Control", "max-age=0"); 
      httpCon.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); 
      httpCon.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36"); 
      httpCon.addRequestProperty("Accept-Encoding", "gzip,deflate,sdch"); 
      httpCon.addRequestProperty("Accept-Language", "en-US,en;q=0.8"); 
      //httpCon.addRequestProperty("Cookie", "JSESSIONID=EC0F373FCC023CD3B8B9C1E2E2F7606C; lang=tr; __utma=169322547.1217782332.1386173665.1386173665.1386173665.1; __utmb=169322547.1.10.1386173665; __utmc=169322547; __utmz=169322547.1386173665.1.1.utmcsr=stackoverflow.com|utmccn=(referral)|utmcmd=referral|utmcct=/questions/8616781/how-to-get-a-web-pages-source-code-from-java; __gads=ID=3ab4e50d8713e391:T=1386173664:S=ALNI_Mb8N_wW0xS_wRa68vhR0gTRl8MwFA; scrElm=body"); 
      HttpURLConnection.setFollowRedirects(false); 
      httpCon.setInstanceFollowRedirects(false); 
      httpCon.setDoOutput(true); 
      httpCon.setUseCaches(true); 

      httpCon.setRequestMethod("GET"); 

      BufferedReader in = new BufferedReader(new InputStreamReader(httpCon.getInputStream(), "UTF-8")); 
      String inputLine; 
      StringBuilder a = new StringBuilder(); 
      while ((inputLine = in.readLine()) != null) 
       a.append(inputLine); 
      in.close(); 

      System.out.println(a.toString()); 

      httpCon.disconnect(); 
} 
}

來源

2013-12-04 16:29:41 Roglesby

幫助永遠不會太晚。但是我嘗試了你的代碼，它在很多網頁中都不起作用。 –

我同意這部分不會針對所有網頁，因爲不同的網頁以不同的格式返回數據，在某些情況下，您可能需要重新定向。在某些情況下，您可能會收到響應作爲gzip響應，您可以按如下所示處理它：InputStream gzippedResponse = httpCon.getInputStream（）; InputStream ungzippedResponse = new GZIPInputStream（gzippedResponse）; InputStreamReader reader = new InputStreamReader（ungzippedResponse，「UTF-8」）; StringWriter writer = new StringWriter（）;' – Roglesby

如何從Java獲取網頁的源代碼

回答

相關問題