2012-04-29 172 views
0

我想刮下面的網站,因爲XML的格式不正確,不包含所有的數據幀的數據進行解析,我需要:與機械化和Ruby

http://www.cafebonappetit.com/menu/your-cafe/pitzer

當我取與機械化的文件,但是,我只得到:

{meta_refresh} 
{title "Collins | Claremont McKenna Cafés | Café Bon Appétit"} 
{iframes} 
{frames} 
{links 
#<Mechanize::Page::Link "Welcome" "http://www.cafebonappetit.com/"> 
#<Mechanize::Page::Link "Our Approach" "javascript://"> 
#<Mechanize::Page::Link 
"Kitchen Principles" 
"http://www.cafebonappetit.com/our-approach/kitchen-principles"> 
..... 
} 

不幸的是,我顯然需要得到什麼是表(我猜他們是iFrame中)英寸有什麼想法嗎?

謝謝!

+2

頁面沒有任何框架或iFrame。 Mechanize只是報告有0個iframe,0個幀,N個鏈接和1個標題。要找到表格,只需使用'page.search('table')' –

+0

謝謝! #railsnewb – AlexSBerman

回答

3

下面是一個簡單的機械+ Nokogiri腳本,用於擦除菜單項。

require 'rubygems' 
require 'mechanize' 
require 'pp' 

agent = Mechanize.new 
url = "http://www.cafebonappetit.com/menu/your-cafe/pitzer" 
page = agent.get(url) 

#Grab each daily menu 
page.search('div#menu-items > table.my-day-menu-table').each do |menu| 
    day = menu.xpath('preceding-sibling::div[1]/a').text.strip 
    puts day 
    fare = [] 
    #Collect the menu items 
    menu.xpath('tr').each do |item| 
    fare << item.xpath('td/strong').map(&:text).join(": ") 
    end 
    pp fare 
end 

結果(節選):

Sunday, May 6th, 2012 
["Brunch", 
"chef's table: custom omelet bar", 
"main plate: chicken sanchez", 
"meatless chicken and sauce", 
"options: banana pancakes", 
"stocks: beed barley", 
"vegetable minestrone", 
"Lunch", 
"main plate: steamed broccoli", 
"Dinner", 
"chef's table: pasta bar", 
"farm to fork: sauteed rainbow chard", 
"options: mozzarella sticks", 
"ovens: pizza bar", 
"main plate: roasted herb chicken", 
"baked ziti pasta", 
"steamed carrots and parsnips"]