2014-10-19 94 views
0

可以使用HTMLEditorKit下載整個網頁。但是,我需要下載需要滾動的整個網頁才能加載其全部內容。這項技術通常通過與Ajax捆綁在一起的JavaScript來實現。下載整個網頁

問:有沒有辦法來欺騙所述指定網頁,使用只有Java code,以下載其全部內容?

問題2:如果這不可能只用Java,那麼是否可以結合使用JavaScript?

簡單的通知,我寫道:

public class PageDownload { 

    public static void main(String[] args) throws Exception { 
     String webUrl = "..."; 
     URL url = new URL(webUrl); 
     URLConnection connection = url.openConnection(); 
     InputStream is = connection.getInputStream(); 
     InputStreamReader isr = new InputStreamReader(is); 
     BufferedReader br = new BufferedReader(isr); 

     HTMLEditorKit htmlKit = new HTMLEditorKit(); 
     HTMLDocument htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument(); 
     HTMLEditorKit.Parser parser = new ParserDelegator(); 
     HTMLEditorKit.ParserCallback callback = htmlDoc.getReader(0); 
     parser.parse(br, callback, true); 

     for (HTMLDocument.Iterator iterator = htmlDoc.getIterator(HTML.Tag.IMG); 
       iterator.isValid(); iterator.next()) { 
      AttributeSet attributes = iterator.getAttributes(); 
      String imgSrc = (String) attributes.getAttribute(HTML.Attribute.SRC); 
      if (imgSrc != null && (imgSrc.endsWith(".jpg") || (imgSrc.endsWith(".jpeg")) 
        || (imgSrc.endsWith(".png")) || (imgSrc.endsWith(".ico")) 
        || (imgSrc.endsWith(".bmp")))) { 
       try { 
        downloadImage(webUrl, imgSrc); 
       } catch (IOException ex) { 
        System.out.println(ex.getMessage()); 
       } 
      } 
     } 

    } 

    private static void downloadImage(String url, String imgSrc) throws IOException { 
     BufferedImage image = null; 
     try { 
      if (!(imgSrc.startsWith("http"))) { 
       url = url + imgSrc; 
      } else { 
       url = imgSrc; 
      } 
      imgSrc = imgSrc.substring(imgSrc.lastIndexOf("/") + 1); 
      String imageFormat = null; 
      imageFormat = imgSrc.substring(imgSrc.lastIndexOf(".") + 1); 
      String imgPath = null; 
      imgPath = "..." + imgSrc + ""; 
      URL imageUrl = new URL(url); 
      image = ImageIO.read(imageUrl); 
      if (image != null) { 
       File file = new File(imgPath); 
       ImageIO.write(image, imageFormat, file); 
      } 
     } catch (Exception ex) { 
      ex.printStackTrace(); 
     } 
    } 

} 
+0

你能不能給一個EXA這樣的網站/頁面的多少個好嗎? – 2014-10-24 22:33:35

回答

1

是的,你可以欺騙一個網頁上下載你的當地人通過Java代碼。您不能通過Java腳本下載HTMl靜態內容。 JavaScript不提供您創建Java提供的文件。

import java.io.File; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.InputStream; 
import java.net.HttpURLConnection; 
import java.net.URL; 


public class HttpDownloadUtility { 
    private static final int BUFFER_SIZE = 4096; 

    /** 
    * Downloads a file from a URL 
    * @param fileURL HTTP URL of the file to be downloaded 
    * @param saveDir path of the directory to save the file 
    * @throws IOException 
    */ 
    public static void downloadFile(String fileURL, String saveDir) 
      throws IOException { 
     URL url = new URL(fileURL); 
     HttpURLConnection httpConn = (HttpURLConnection) url.openConnection(); 
     int responseCode = httpConn.getResponseCode(); 

     // always check HTTP response code first 
     if (responseCode == HttpURLConnection.HTTP_OK) { 
      String fileName = ""; 
      String disposition = httpConn.getHeaderField("Content-Disposition"); 
      String contentType = httpConn.getContentType(); 
      int contentLength = httpConn.getContentLength(); 

      if (disposition != null) { 
       // extracts file name from header field 
       int index = disposition.indexOf("filename="); 
       if (index > 0) { 
        fileName = disposition.substring(index + 10, 
          disposition.length() - 1); 
       } 
      } else { 
       // extracts file name from URL 
       fileName = fileURL.substring(fileURL.lastIndexOf("/") + 1, 
         fileURL.length()); 
      } 

      System.out.println("Content-Type = " + contentType); 
      System.out.println("Content-Disposition = " + disposition); 
      System.out.println("Content-Length = " + contentLength); 
      System.out.println("fileName = " + fileName); 

      // opens input stream from the HTTP connection 
      InputStream inputStream = httpConn.getInputStream(); 
      String saveFilePath = saveDir + File.separator + fileName; 

      // opens an output stream to save into file 
      FileOutputStream outputStream = new FileOutputStream(saveFilePath); 

      int bytesRead = -1; 
      byte[] buffer = new byte[BUFFER_SIZE]; 
      while ((bytesRead = inputStream.read(buffer)) != -1) { 
       outputStream.write(buffer, 0, bytesRead); 
      } 

      outputStream.close(); 
      inputStream.close(); 

      System.out.println("File downloaded"); 
     } else { 
      System.out.println("No file to download. Server replied HTTP code: " + responseCode); 
     } 
     httpConn.disconnect(); 
    } 
} 
+0

insanovation我對你的問題有意義。 – UtkarshBhavsar 2014-10-27 10:50:07

+0

我現在真的很忙,但我會盡快回復這個問題(7小時內)。在我研究你提出的解決方案之後,你的幫助將會得到回報。感謝您的理解。 – Insanovation 2014-10-27 13:07:58

+0

太棒了,它工作。不過,我在9gag.com上測試過它,並沒有下載整個內容。如果滾動9gag,大約30秒,您將進入頁面的底部。直到那時,有很多圖像,並且它們的結尾.jpg或.gif不在您的代碼提供的下載文件中。我認爲你的方式可能是唯一暴露在這裏的方式......如果不會發布更有效的代碼,那麼賞金就會發給你。謝謝。 – Insanovation 2014-10-27 20:58:12