2012-02-12 71 views

回答

-1

只要使用wget!

的Linux wget的例子: http://linuxreviews.org/quicktips/wget/

Wget的Windows版: ​​

+0

爲什麼downvote? Wget可以單獨使用或從Java中調用,它包含了一個強大的,經過測試的軟件包中的所有必需的功能。 – dotancohen 2012-03-01 10:08:23

0

我不知道是否有一個Apache庫,但我使用htmlunit以抓取網頁和所有它的子代碼頁如下。然後可以通過URLConnection完成下載,參見例如。 this page


    public static void walkAllHtmlPages(final String startURL) throws IOException, SAXException { 
    final WebClient webClient = createWebClient(); 

    try { 
     final HtmlPage page = webClient.getPage(startURL); 
      try { 
       Set visitedURLs = new HashSet(); 

       List links = page.getAnchors(); 

       // now recursively walk all pages 
       recursivelyFollowLinks(webClient, links, visitedURLs); 
      } finally { 
       if(page != null) { 
        page.cleanUp(); 
       } 
      } 
    } finally { 
     webClient.closeAllWindows(); 
    } 
} 

    public static WebClient createWebClient() { 
     final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6); 
     webClient.setTimeout(30000); 
     webClient.setJavaScriptEnabled(false); 
     webClient.setCssEnabled(true); 
     webClient.setAppletEnabled(true); 
     webClient.setRedirectEnabled(true); // follow old-school HTTP 302 redirects - standard behaviour 

     webClient.setHTMLParserListener(null); 
     webClient.setIncorrectnessListener(new IncorrectnessListener() { 
      @Override 
      public void notify(String message, Object origin) { 
       // Swallow for now, but maybe collect it for optional retrieval? 
      } 
     }); 
     webClient.setCssErrorHandler(new SilentCssErrorHandler()); 

     return webClient; 
    } 

    private static void recursivelyFollowLinks(WebClient webClient, List links, Set visitedURLs) throws SAXException, IOException { 
     try { 
      for(HtmlAnchor link : links) { 
       String url = link.getHrefAttribute(); 

       if (!visitedURLs.contains(url)) { 
        visitedURLs.add(url); 

        visitSubLink(webClient, visitedURLs, link, url); 
       } 
      } 
     } catch (RuntimeException e) { 
      throw new IllegalArgumentException("While retrieving links: " + getLinksAsString(links), e); 
     } 
    } 

    private static void visitSubLink(WebClient webClient, 
      Set visitedURLs, HtmlAnchor link, String url) throws IOException, SAXException { 
     URL current = link.getPage().getUrl(); 

     try { 
      HtmlPage ret = (HtmlPage)link.click(); 

      List sublinks = ret.getAnchors(); 

      recursivelyFollowLinks(webClient, sublinks, visitedURLs); 
     } catch (RuntimeException e) { // NOPMD 
      throw new RuntimeException("While clicking link: " + link.getId() + " to " + url, e); 
     } 
    } 

相關問題