Nokogiri支持兩種主要類型的搜索,search
和at
。 search
返回一個NodeSet,你應該像數組一樣考慮它。 at
返回一個節點。可以採用CSS或XPath表達式。我更喜歡CSS,因爲它們更具可讀性,但有時候你不能輕易到達你想要的位置,所以試試其他的。
對於您的問題,使用text
指定要從中提取文本的節點很重要。如果結果太寬泛,除了標籤內的文本之外,還可以從標籤之間獲取文本。爲了避免鑽到最直接的節點到你想讀什麼:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<release>
<artists>
<artist>
<name>Johnny Mnemonic</name>
</artist>
<artist>
<name>Constantine</name>
</artist>
<artists>
<release>
EOT
因爲這些尋找專門的name
節點,需要的文字是很容易得到無垃圾:
doc.at('name').text # => "Johnny Mnemonic"
doc.at('artist name').text # => "Johnny Mnemonic"
doc.at('artists artist name').text # => "Johnny Mnemonic"
這些都是寬鬆的搜索,以便更多的垃圾返回:
doc.at('artist').text # => "\n Johnny Mnemonic\n "
doc.at('artists').text # => "\n \n Johnny Mnemonic\n \n \n Constantine\n \n \n\n"
使用search
返回多個節點:
doc.search('name').map(&:text)
[
[0] "Johnny Mnemonic",
[1] "Constantine"
]
doc.search('artist').map(&:text)
[
[0] "\n Johnny Mnemonic\n ",
[1] "\n Constantine\n "
]
search
和at
之間的唯一真正的區別在於at
就像search(...).first
。
也參見「How to avoid joining all text from Nodes when scraping」。
引入nokogiri有方便一些額外的別名:at_css
和css
,並at_xpath
和xpath
。
這裏有替代方式,使用CSS和XPath訪問器來獲得的名稱,從撬剪短
[5] (pry) main: 0> # using CSS with Ruby
[6] (pry) main: 0> artists = doc.search('release').map{ |release| release.at('artist').text.strip }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[7] (pry) main: 0> # using CSS with less Ruby
[8] (pry) main: 0> artists = doc.search('release artists artist:nth-child(1) name').map{ |n| n.text }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[9] (pry) main: 0>
[10] (pry) main: 0> # using XPath
[11] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
[12] (pry) main: 0> # using more XPath
[13] (pry) main: 0> artists = doc.search('release/artists/artist[1]/name/text()').map{ |t| t.content }
[
[0] "Johnny Mnemonic",
[1] "Speed"
]
'page.xpath( 「釋放/藝術家/藝術家」)first'? – ted 2013-03-18 20:22:26