機械化HTML刮取問題

所以我試圖提取我的網站使用紅寶石機械化和hpricot的電子郵件。什麼我試圖做我的行政管理方面的所有頁面上的循環，並用hpricot.so解析頁面非常好。然後我得到：機械化HTML刮取問題

Exception `Net::HTTPBadResponse' at /usr/lib/ruby/1.8/net/http.rb:2022 - wrong status line: *SOME HTML CODE HERE*

當它解析一堆頁面時，它以超時開始，然後打印頁面的html代碼。不知道爲什麼？我該如何調試？它似乎機械化可以得到超過10頁連續？可能嗎？？感謝

 

require 'logger' require 'rubygems' require 'mechanize' require 'hpricot' require 'open-uri' 

class Harvester 

def initialize(page) @page=page @agent = WWW::Mechanize.new{|a| a.log = Logger.new("logs.log") } @agent.keep_alive=false @agent.read_timeout=15 

end 

def login f = @agent.get("http://****.com/admin/index.asp") .forms.first f.set_fields(:username => "user", :password =>"pass") f.submit
 end 

def harvest(s) pageNumber=1 #@agent.read_timeout = s.upto(@page) do |pagenb| 

puts "*************************** page= #{pagenb}/#{@page}***************************************"  
    begin 
     #time=Time.now 
     #[email protected]("http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}")   
     extract(pagenb) 

    rescue => e 
     puts "unknown #{e.to_s}" 
     #puts "url:http://****.com/admin/members.asp?action=search&term=&state_id=&r=500&p=#{page}" 
     #sleep(2) 
     extract(pagenb) 

    rescue Net::HTTPBadResponse => e 
     puts "net exception"+ e.to_s 
    rescue WWW::Mechanize::ResponseCodeError => ex 
     puts "mechanize error: "+ex.response_code 
    rescue Timeout::Error => e 
     puts "timeout: "+e.to_s 
    end 


end 
 

末 

高清提取物（頁） #puts search.body [email protected]（「HTTP：//***.com/admin/members.asp ？行動=搜索&術語= & STATE_ID = & R = 500 & p =＃{頁}「） DOC =角度來說，Hpricot（search.body） 

 #remove titles 
     #~ doc.search("/html/body/div/table[2]/tr/td[2]/table[3]/tr[1]").remove 

     (doc/"/html/body/div/table[2]/tr/td[2]/table[3]//tr").each do |tr|    
      #delete the phone number from the html 
      temp = tr.search("/td[2]").inner_html 
      index = temp.index('<') 
      email = temp[0..index-1] 
      puts email 
      f=File.open("./emails", 'a') 
      f.puts(email) 
      f.close  
     end 
 

個端 

端 

看跌期權 「開始提取專用電子郵件...」 

開始= ARGV [0] .to_i 

H = Harvester.new（186） h.login ħ .harvest（開始）

來源

2009-05-24 fenec

機械化提出FUL l將頁面內容放入歷史記錄中，這可能會在瀏覽多個頁面時導致問題。要限制歷史記錄的大小，請嘗試

@mech = WWW::Mechanize.new do |agent| 
    agent.history.max_size = 1 
end

來源

2009-08-27 14:12:22 Fluffy

機械化HTML刮取問題

回答

相關問題