2013-11-26 53 views
0

我試圖用Nokogiri和Ruby 1.9.3來抓取財務數據的頁面。如何在Nokogiri中指定XPATH或CSS來刮取頁面的表格數據?

我無法得到正確的XPath或CSS濾鏡來獲取用於保存數據的表,然後通過數據迭代和組裝它,以便輸出可以被放入一個CSV文件是這樣的:

Date, Company,Symbol,ReportedEPS,Consensus EPS 
20130828,CDN WESTERN BANK,CWB.TO,0.60,0.59 

我用Firebug獲取XPath和CSS數據。什麼是正確的格式爲XPath或CSS提取表然後迭代通過行組裝它們輸出到文件?

require 'rubygems' 
require 'mechanize' 
require 'nokogiri' 
require 'uri' 

@agent = Mechanize.new do|a|  
    a.user_agent_alias = "Windows IE 6" 
end 

url = "http://biz.yahoo.com/z/20130828.html" 
page = @agent.get(url) 
doc = Nokogiri::HTML(page.body) 
puts doc.inspect 

#~ from firebug 
#~ xpath  /html/body/p[3]/table/tbody 
#~ css  html body p table tbody 

回答

1

爲了便於閱讀,我通常在XPath上使用CSS。這有點像我會使用:

require 'open-uri' 
require 'nokogiri' 

URL = "http://biz.yahoo.com/z/20130828.html" 
doc = Nokogiri::HTML(open(URL)) 
table = doc.css('table')[4] 

data = table.search('tr')[2..-1].map { |row| 
    row.search('td').map(&:text) 
} 

data 
# => [["CDN WESTERN BANK", 
#  "CWB.TO", 
#  "1.69", 
#  "0.60", 
#  "0.59", 
#  "N/A", 
#  "Quote, Chart, News, ProfileReports, Research"], 
#  ["Casella Waste Systems, Inc.", 
#  "CWST", 
#  "71.43", 
#  "-0.02", 
#  "-0.07", 
#  "N/A", 
#  "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"], 
#  ["Culp, Inc. Common Stock", 
#  "CFI", 
#  "5.56", 
#  "0.38", 
#  "0.36", 
#  "Listen", 
#  "Quote, Chart, News, ProfileReports, Research, Msgs, Insider, Analyst Ratings"], 

還有很多更多的數據返回,但是這足以說明什麼代碼抓住。

完全沒有必要使用機械化來完成這項任務。除非你需要瀏覽一個網站,否則Mechanize不是很幫你,所以我會用OpenURI。

也參見「How to avoid joining all text from Nodes when scraping」。

+0

正是我想要的。謝謝。 – user2720047

2

,當他們解析/驗證/固定輸入HTML有些瀏覽器將添加一個<tbody><table>。 Firefox是這些瀏覽器之一。您從Firefox中獲得的XPath和CSS表達式是針對Firefox所看到的HTML,並不一定是Nokogiri會看到的HTML。

<tbody>並嘗試這個XPath:

/html/body/p[3]/table 

查找表。您還可以查看原始HTML並查看錶格上是否存在可用於CSS id#the-id)或類(.the-class)選擇器的id屬性或class屬性,而不是大型元素路徑。