2014-01-19 45 views
0
下一頁

我寫的代碼,擦傷和分析本網站=> www.africancollective.come /眉毛/非洲文學/小說使用while循環與引入nokogiri導航到

require 'ruby gems' 
require 'nokogiri' 
require 'open-uri' 
require 'ap' 
require 'debugger' 
require 'csv' 

#collect all the authors, books, ISBN, publisher info 
#==================================================== 
url = 'http://www.africanbookscollective.com/browse/african-literature/fiction' 
page = Nokogiri::HTML(open(url)) 

# create an array for every book content on each page that has element of form 
# [<ISBN Number>, <Book Pages>, <Book Dimensions>, <First Published>, <Publisher>,<CoverType>] 
# save array into a csv file with the columns of: 
# <ISBN Number> <Book Pages> <Book Dimensions> <First Published> <Publisher> <CoverType> 

# opens a csv file and shovels column titles into the first row 
CSV.open("bookinfo.csv", "w+") do |csv| 
    csv << ["ISBN Number", "Book Pages", "Book Dimensions", "First Published", "Publisher", "CoverType"] 
end 

# initializes another_page and page_num varaibles 
page_num = 0 

# the while loop runs as long as the statement below evaluates to true 
#while page_num < 390 
new_page = Nokogiri::HTML(open("http://www.africanbookscollective.com/browse/african-studies?b_start:int=#{page_num+10}&amp;-C=")) 
    # search for the context-details of each book 
    books = page.css('p.context-details').map do |book| 
    book.text.gsub(/\s{2,}/, "").chomp.split(" |") 
    end 


    #appends context-details onto the csv we already created 
    CSV.open("bookinfo.csv", "a+") do |csv| 
    books.each do |book| 
     csv << book 
    end 
    end 
    page_num += 10 
#end 
    enter code here 

此代碼的信息只在第一頁上給我提供信息;它沒有抓住所有其餘的頁面(1 - 38)。我認爲這與我的while循環的結構有關,對吧?

爲什麼是不是在繼續使用格式在NEW_PAGE提供的字符串插值 下頁?

謝謝

+0

它應該是' 「http://www.africanbookscollective.com/browse/african-studies?b_start:int=#{page_num} & -C =」' –

+0

嘗試張貼之前的建議,它沒有工作。 – Uzzar

+0

'books = page.css('p.context-details')'change to'books = new_page.css('p.context-details')' –

回答

1

忘記數字,並按照「下一步」的鏈接迭代。它應該是這個樣子:

# page 1 
page = Nokogiri::HTML(open(start_url)) 
do_something_with page 

# repeat until no more "next" links 
while a = page.at('a[title="Next page"]') 
    page = Nokogiri::HTML(open(a[:href])) 
    do_something_with page 
end 
+0

hello!只是嘗試了你的建議,但得到了main:Object(NameError)'的錯誤'未定義的局部變量或方法start_url'。你能澄清一下start_url的含義嗎?謝謝 – Uzzar

+0

如果你猜測哪個url是start_url,那麼怎麼樣? – pguardiario

+0

start_url將是'「http://africanbookscollective.com/browse/african-studies?b_start:int=0 & -C =」' 對嗎?如果是的話nokogiri怎麼知道下一頁是什麼?請問,因爲我檢查了該頁面,並且沒有指向它的下一頁的鏈接。 道歉爲noob問題...仍試圖瞭解某些事情是如何工作的。感謝您的幫助 – Uzzar