2013-02-23 41 views
0

我試圖爲特定的網站創建Web內容crawler的HtmlUnit的XPath getElement

http://v1000.vn/bang-xep-hang?ref=bang-xep-hang-1000-doanh-nghiep-dong-thue-thu-nhap-nhieu-nhat-2012

不久,我的XPath查找發生改變的頁面(使用JavaScript)的鏈路不是導致NullPointExecetion的工作。我試圖以各種方式修改XPath,但沒有任何工作。

另外,我需要運行任何方法來獲得腳本運行後的新頁面嗎?

> package gimasys.webService; 

import java.io.IOException; 
import java.net.MalformedURLException; 
import com.gargoylesoftware.htmlunit.BrowserVersion; 
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException; 
import com.gargoylesoftware.htmlunit.ThreadedRefreshHandler; 
import com.gargoylesoftware.htmlunit.WebClient; 
import com.gargoylesoftware.htmlunit.html.HtmlAnchor; 
import com.gargoylesoftware.htmlunit.html.HtmlButton; 
import com.gargoylesoftware.htmlunit.html.HtmlLink; 
import com.gargoylesoftware.htmlunit.html.HtmlPage; 

public class Crawlv1000 { 

    /** 
    * @param args 
    */ 
    public static void main(String[] args) { 
     // TODO Auto-generated method stub 

     final WebCrawler wc = new WebCrawler(); 
     final PageCrawler pc = new PageCrawler(); 

     final WebClient webClient = new WebClient(BrowserVersion.CHROME_16); 
     webClient.setRefreshHandler(new ThreadedRefreshHandler()); // This is to allow handling the page operation using threads else an exception will pop up 
     try { 
      HtmlPage page = webClient.getPage("http://v1000.vn/bang-xep-hang?ref=bang-xep-hang-1000-doanh-nghiep-dong-thue-thu-nhap-nhieu-nhat-2012"); 
      HtmlAnchor link = page.getFirstByXPath("//a[@href='javascript:loadRankingTable(3)']"); 
         link.click(); 
         System.out.println(page.getTextContent()); 

     } catch (FailingHttpStatusCodeException | IOException e) { 
      // TODO Auto-generated catch block 
      e.printStackTrace(); 
     } 
     /* 
     wc.crawl("http://v1000.vn/bang-xep-hang?ref=bang-xep-hang-1000-doanh-nghiep-dong-thue-thu-nhap-nhieu-nhat-2012"); 

     for (String url:wc.urlList) 
     { 
      pc.crawl(url); 
     } 
     */ 
    } 
} 

感謝, 胡志明市阮

回答

0

非常小的失誤由你做了,親愛的,分號錯誤

HtmlAnchor link = page.getFirstByXPath("//a[@href='javascript:loadRankingTable(3);']"); 
link.click();