如何使用XPath與nokogiri解析inner_html裏面的for循環

我遇到了麻煩解析裏面只有我發現的inner_html循環。我只想在該內容中再次使用XPath。我是新來的紅寶石，所以更好的解決方案在桌面上。如何使用XPath與nokogiri解析inner_html裏面的for循環

#!/usr/bin/ruby -w 

require 'rubygems' 
require 'nokogiri' 

page1 = Nokogiri::HTML(open('mycontacts.html')) 


# Search for nodes by xpath 
page1.xpath('//html/body/form/div[2]/span/table/tbody/tr').each do |row| 
    #puts a_tag.content 
    puts "new row" 
    row_html = row.inner_html 

    puts row_html 
    puts "" 

    name = row_html.xpath("/td[1]").text 
    puts "name is " + name 

end

我在for循環的每一行的輸出是一樣的東西：

new row 
<th>First Name</th> 
<th>Last Name</th> 
<th>Phone</th>

下面是我得到的錯誤：

屏幕scraper.rb：20：在block in <main>': undefined method xpath'for＃（NoMethodError）

我想解析每個tr並獲取如下數據：Barney Rubble，Fred Flintstone

<table> 
    <tbody> 
     <tr> 
      <th>First Name</th> 
      <th>Last Name</th> 
     </tr> 
     <tr> 
      <td>Fred</td> 
      <td>Flintstone</td> 
     </tr> 
     <tr> 
      <td>Barney</td> 
      <td>Rubble</td> 
     </tr> 
    </tbody> 
</table>

我願意接受建議。我認爲在for循環中只解析inner_html更容易，但如果有更簡單的方法可以在for循環中找到節點，那也可以。

感謝....

來源

2013-06-19 Nick N

請分享你解析 –

我更新，它包括像什麼，我試圖做一個樣本HTML部分。 –

您可以修復，而不是使用name = row_html.xpath("/td[1]").text它，使用name = Nokogiri::HTML(row_html).xpath("/td[1]").text。雖然如果您分享您隨身攜帶的完整HTML，那麼這樣做有很好的技巧。

Nokogiri::HTML(row_html)會給你類Nokogiri::HTML::Document的實例。現在#xpath,#css和#search所有的方法都是Nokogiri::HTML::Document類的實例方法。

考慮到如果您的inner_html產生您提供的HTML表，那麼您可以考慮如下。

我做了測試代碼，並希望它會給你的結果：

require "nokogiri" 

doc = Nokogiri::HTML(<<-eohl) 
<table> 
    <tbody> 
     <tr> 
      <th>First Name</th> 
      <th>Last Name</th> 
     </tr> 
     <tr> 
      <td>Fred</td> 
      <td>Flintstone</td> 
     </tr> 
     <tr> 
      <td>Barney</td> 
      <td>Rubble</td> 
     </tr> 
    </tbody> 
</table> 
eohl 

doc.css("table > tbody > tr"). each do |nd| 
nd.children.each{|i| print i.text.strip," " unless i.text.strip == "" } 
print "\n" 
end 
# >> First Name Last Name 
# >> Fred Flintstone 
# >> Barney Rubble

現在在這裏看到的#inner_html給人，這inturn會回答你，爲什麼你得到了沒有這樣的方法錯誤：

require "nokogiri" 

doc = Nokogiri::HTML(<<-eohl) 
<table> 
    <tbody> 
     <tr> 
      <th>First Name</th> 
      <th>Last Name</th> 
     </tr> 
     <tr> 
      <td>Fred</td> 
      <td>Flintstone</td> 
     </tr> 
     <tr> 
      <td>Barney</td> 
      <td>Rubble</td> 
     </tr> 
    </tbody> 
</table> 
eohl 

doc.search("table > tbody > tr"). each do |nd| 
p nd.inner_html.class 
end 

# >> String 
# >> String 
# >> String

來源

2013-06-19 14:41:16

優秀的建議。我會給你的例子一個鏡頭。謝謝。我會稍後再報告。 –

問題是，row_html，由Nokogiri::XML::Node#inner_html獲得，只是一個字符串。要再次調用xpath，必須先使用Nokogiri::HTML(row_html)再次用Nokogiri解析字符串。

更好的方法是不要首先打電話inner_html，將row作爲Nokogiri::XML::Node，然後致電row.xpath(...)。

例如，對於一個表像您和輸出你想要的東西：

page1.xpath('//html/body/form/div[2]/span/table/tbody/tr').each do |row| 
    puts "#{row.children[0].text} #{row.children[1].text}" 
end

來源

2013-06-19 14:50:08 robertjlooby

優秀的建議。我會嘗試不再使用inner_html調用，只使用row.xpath。另外，我注意到Firebug產生了一些xpath表達式，這些表達式對於Nokogiri（或者它的依賴）不起作用。 Chrome的Debug XPath輸出讓我的運氣更好。 –

...I've noticed that Firebug produces some xpath expressions that don't work well with Nokogiri (or its dependency). I'm having better luck with Chrome's Debug XPath output.

從瀏覽器Firebug的，或許多其他的XPath產出的問題，是他們遵循的是HTML規範產生時XPath併合成一個<tbody>標記，即使原始來源沒有。 XPath反映了這一點。

我們將原始的HTML引入nokogiri用於解析，與錯誤的XPath一起，並引入nokogiri找不到<table><tbody><tr>鏈。

這裏有一個例如。這個HTML開始：

<html> 
    <body> 
    <table> 
     <tr> 
     <td> 
      foo 
     </td> 
     </tr> 
    </table> 
    </body> 
</html>

保存到一個文件，在Firefox，Chrome或Safari瀏覽器打開它，然後查看源代碼，看看它在Firebug或同等學歷。

你會看到這樣的事情，這些來自火狐：

<table> 
    <tbody><tr> 
    <td> 
     foo 
    </td> 
    </tr> 
</tbody></table>

爲了解決這個問題，不依賴於瀏覽器生成的XPath和通過看只有確認表的結構RAW在文本編輯器中的HTML。「查看源」選項對於某些事情很有用，但是如果看到任何<tbody>標記可疑並回復到編輯檢查。

而且，你不需要標籤的整個產業鏈，達到內標籤。相反，尋找一些可以幫助您找到目標節點的地標。現在大多數HTML頁面都有重要標籤中的class和id參數。 ID參數尤其重要，因爲它們必須是唯一的。如果存在唯一的其他參數，那麼這些參數也可以工作。

有時候，你不會找到一個識別標籤緊接一個你想要的，但嵌在它的東西。然後，找到該嵌入式標籤並加強鏈條，直到找到您想要的。使用XPath你可以使用..（父），但與CSS，你必須依靠引入nokogiri :: XML :: Node的parent方法，因爲引入nokogiri和CSS不支持父選擇器（還）。

來源

2013-06-21 06:37:01

如何使用XPath與nokogiri解析inner_html裏面的for循環

回答

相關問題