2015-04-16 47 views
0

我希望使用機械抓取某些網頁和保存信息來自動化一個進程。迭代機械化抓取頁面

該頁面是 look book north america

我希望遍歷ul id="looks",並在該迭代中單擊外觀中的每個用戶。所以元素看起來是這樣的:

<a href="/luciamouet" data-page-track="user name click" data-track="user name click | byline" target="_blank" title="Lucia Mouet">Lucia M.</a> 

我希望去每個用戶和存儲該頁面的一些信息。

這是我迄今爲止,但我難倒如何遍歷和跟隨鏈接爲每個用戶:

require 'rubygems' 
require 'mechanize' 
require 'nokogiri' 
require 'open-uri' 

agent = Mechanize.new 

page = agent.get('http://lookbook.nu/north-america') 

looks = page.parser.css('#looks p') 

looks.each do |x| 
    puts x 
end 
+0

除非您使用的是很舊版本的Ruby,你不需要'要求' rubygems''。你不需要'需要'nokogiri',因爲它已經是Mechanize的依賴。另外,您可能不需要'require'open-uri'',因爲Mechanize提供了自己的抓取頁面的機制。 –

回答

1

你什麼都有了構建詳細信息頁面的URL。抓住相對URL(我將稱之爲路徑)附加基本URL併發出新請求。

require 'mechanize' 

agent = Mechanize.new 
agent.pluggable_parser.default = Mechanize::Page 

base = 'http://lookbook.nu' 
page = agent.get(base + '/north-america') 

detail_pages = page.search("//div[contains(@class, 'look_meta_container')]/p/a[1]/@href").map(&:text) 
# ["/user/1069907-Veronica-P", "/elliott_alexzander", "/neno", "/skirtsofurban", "/tovogueorbust", "/dthutt", "/ryapie", "/lovebetweentheracks", "/lonleyboy", "/bobbyraffin", "/tsangtastic", "/user/737385-Katia-H"] 

detail_pages.each do |path| 
    page = agent.get(base + path) 

    name = page.search("//div[@id='userheader']//h1/a").text 
    fans = page.search("//span[contains(text(), 'Fans')]/../span[1]").text 

    puts name + " have " + fans + " fans" 
end 

=>

Veronica P have 26,044 fans 
Elliott Alexzander have 3,409 fans 
Neno Neno have 15,304 fans 
Laura P have 975 fans 
Alexandra G. have 620 fans 
Dayeanne Hutton have 336 fans 
Mariah Alysz have 288 fans 
Lina Dinh have 11,675 fans 
Talal Amine have 882 fans 
Bobby Raffin have 72,469 fans 
Jenny Tsang have 8,909 fans 
Katia H. have 282 fans 

注:我爲了得到一個Mechanize::Page響應使用#pluggable_parser.default。通常你不需要,但他們沒有正確設置內容類型。

1

,而不是與基地+路徑勾搭由@radubogdan建議,只需使用page.uri:

page.search('#looks h1 a').each do |a| 
    url = page.uri.merge a[:href] 
    page2 = agent.get url 
    puts page2.title 
end 
+0

是的,這對我來說很清楚,但我覺得這對初學者來說很困惑。好的事情你提出來,但你可以在評論中做到這一點。 – radubogdan

+1

我可以在評論中放5行代碼?那本來會很麻煩。 – pguardiario