2013-05-02 36 views
0

我試圖返回精確的XPATH查詢表達式,因此我可以用rapidminer數據庫來搜索網站。 我需要一個查詢單獨每行隔離:Java中的精確XPATH位置

星期三2012年7月11日

TROLL

12年7月11日

意味着文件提出

Tue 20/11/2012 1:12 PM

到目前爲止,所有我是//td[@class='select']/text()

注:值將改變,因此查詢需要具體位置。

對於每個值,這六個單獨的查詢會是什麼?

 <tr> 
      <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')"> 
      Wed 7/11/2012<br> 
      TROLL&nbsp; 

      </td> 
      <td class="select" align="center" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')"> 
      9999999999999 
      <br>07.11.12 

      &nbsp; 
      </td> 
      <td class="select" onClick="javascript:window.location.href = 'consignmentDetails.do;jsessionid=7e6a45cbddf07ecba7741e5020b4bfe76e53b8f5df9ea83eaf2040b991792d25.e3iMc3eQax8Re34Qb3aKbNmOch90?consignment=1388730000024&recordCreatedBy=FIMS&groupId=';" onMouseOver="backColorChange(this,'FFFFCC')" onMouseOut="backColorChange(this,'ffffff')"> 




       CONNOTE FILE LODGED <br> 
       Tue 20/11/2012 1:12 PM 
       &nbsp; 



&nbsp; 
      </td> 

     </tr> 

    </table> 

回答

0

使用Ruby庫Nokogiri(代表對libxml2的頂部,實現的XPath 1.0)測試:

XPATHS = %w{ 
    //tr/td[1]/text()[1] 
    //tr/td[1]/text()[2] 
    //tr/td[2]/text()[1] 
    //tr/td[2]/text()[2] 
    //tr/td[3]/text()[1] 
    //tr/td[3]/text()[2] 
} 

require 'nokogiri' 
d = Nokogiri.HTML(html) 

XPATHS.each{ |expression| p d.at_xpath(expression).content } 
#=> "\n   Wed 7/11/2012" 
#=> "\n   TROLL\u00A0\n\n   " 
#=> "\n   9999999999999\n   " 
#=> "07.11.12\n\n   \u00A0\n   " 
#=> "\n\n\n\n\n    CONNOTE FILE LODGED " 
#=> "\n    Tue 20/11/2012 1:12 PM\n    \u00A0\n\n\n\n\u00A0\n   " 

正如你所看到的,文本節點包含了很多額外的前端和後端您可能想要刪除的空白。我們可以通過使用normalize-space

XPATHS = %w{ 
    normalize-space(//tr/td[1]/text()[1]) 
    normalize-space(//tr/td[1]/text()[2]) 
    normalize-space(//tr/td[2]/text()[1]) 
    normalize-space(//tr/td[2]/text()[2]) 
    normalize-space(//tr/td[3]/text()[1]) 
    normalize-space(//tr/td[3]/text()[2]) 
} 

XPATHS.each{ |expression| p d.xpath(expression) } 
#=> "Wed 7/11/2012" 
#=> "TROLL\u00A0" 
#=> "9999999999999" 
#=> "07.11.12 \u00A0" 
#=> "CONNOTE FILE LODGED" 
#=> "Tue 20/11/2012 1:12 PM \u00A0 \u00A0"