Nokogiri的數據刮取

我能夠使用Nokogiri刮http://www.example.com/view-books/0/new-releases，但我如何刮所有的頁面？這個有五頁，但不知道最後一頁如何進行？Nokogiri的數據刮取

這是我寫的程序：

require 'rubygems' 
require 'nokogiri' 
require 'open-uri' 
require 'csv' 

urls=Array['http://www.example.com/view-books/0/new-releases?layout=grid&_pop=flyout', 
      'http://www.example.com/view-books/1/bestsellers', 
      'http://www.example.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253' 
     ] 

@titles=Array.new 
@prices=Array.new 
@descriptions=Array.new 
@page=Array.new 

urls.each do |url| 
    doc=Nokogiri::HTML(open(url)) 
    puts doc.at_css("title").text 

    doc.css('.fk-inf-scroll-item').each do |item| 
    @prices << item.at_css(".final-price").text 
    @titles << item.at_css(".fk-srch-title-text").text 
    @descriptions << item.at_css(".fk-item-specs-section").text 
    @page << item.at_css(".fk-inf-pageno").text rescue nil 
    end 

    ([email protected] - 1).each do |index| 
    puts "title: #{@titles[index]}" 
    puts "price: #{@prices[index]}" 
    puts "description: #{@descriptions[index]}" 
    # puts "pageno. : #{@page[index]}" 
    puts "" 
    end 
end 

CSV.open("result.csv", "wb") do |row| 
    row << ["title", "price", "description","pageno"] 
    ([email protected] - 1).each do |index| 
    row << [@titles[index], @prices[index], @descriptions[index],@page[index]] 
    end 
end

正如你可以看到我已經硬編碼的URL。你如何建議我刮掉整個書籍類別？我正在嘗試海葵，但無法啓動它。

來源

2012-09-04 Aayush

由於頁不充分y加載在html源代碼上，但當用戶瀏覽頁面時由某些js加載。您需要模擬用戶操作或執行js。這與nokogiri無關。也許'watir'寶石可以提供幫助。 – halfelf

好吧，我們會嘗試一下...... – Aayush

它總是有助於顯示您所寫的代碼，所以我們可以幫助您對其進行修改，而不是期望我們對您可能或不可能寫出的內容進行瘋狂猜測。 –

如果您檢查加載更多結果時究竟發生了什麼，您將意識到它們實際上使用JSON來讀取具有偏移量的信息。

所以，你可以得到五頁是這樣的：

http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=0 
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=20 
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=40 
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=60 
http://www.flipkart.com/view-books/0/new-releases?response-type=json&inf-start=80

基本上你不斷遞增inf-start，直到你得到result-set小於20這應該是你的最後一頁得到的結果。

來源

2012-09-04 08:37:34 ddb

哇，他們使用JSON提供HTML片段 - 這有點荒謬。 –

如果使用'response-type = html'，他們會以HTML格式返回結果。 –

這裏是代碼做你的是什麼樣的未經檢驗的樣品，只寫了多一點簡潔：

require 'nokogiri' 
require 'open-uri' 
require 'csv' 

urls = %w[ 
    http://www.flipkart.com/view-books/0/new-releases?layout=grid&_pop=flyout 
    http://www.flipkart.com/view-books/1/bestsellers 
    http://www.flipkart.com/books/pre-order?query=book&cid=1&layout=list&ref=4b116001-01a6-4f53-8da7-945b74fdb253 
] 

CSV.open('result.csv', 'wb') do |row| 

    row << ['title', 'price', 'description', 'pageno'] 

    urls.each do |url| 

    doc = Nokogiri::HTML(open(url)) 
    puts doc.at_css('title').text 

    doc.css('.fk-inf-scroll-item').each do |item| 

     page = { 
     titles:  item.at_css('.fk-srch-title-text').text, 
     prices:  item.at_css('.final-price').text, 
     descriptions: item.at_css('.fk-item-specs-section').text, 
     pageno:  item.at_css('.fk-inf-pageno').text rescue nil, 
     } 

     page.each do |k, v| 
     puts '%s: %s' % [k.to_s, v] 
     end 

     row << page.values 
    end 
    end 
end

有數據的一些有用的作品，你可以用它來幫助你找出你有多少記錄需要檢索：

var config = {container: "#search_results", page_size: 20, counterSelector: ".fk-item-count", totalResults: 88, "startParamName" : "inf-start", "startFrom": 20};

要訪問的值，用這樣的：

doc.at('script[type="text/javascript+fk-onload"]').text =~ /page_size: (\d+).+totalResults: (\d+).+"startFrom": (\d+)/ 
page_size, total_results, start_from = $1, $2, $3

來源

2012-09-05 06:18:03

謝謝你的幫助！將在此工作。 – Aayush

Nokogiri的數據刮取

回答

相關問題