2011-02-04 40 views
4

我剛剛接觸網絡抓取,剛開始嘗試使用Python編寫的抓取框架Scrapy。我的目標是刮掉舊的雅虎集團,因爲他們沒有提供API或任何其他手段來檢索郵件歸檔。雅虎組的設置使得您必須先登錄才能查看存檔。使用scrapy刮掉雅虎組的問題

我需要完成,我認爲,這些步驟是:

  1. 登錄到雅虎
  2. 訪問的URL第一條消息和刮它
  3. 重複步驟2的下一條消息,等

我開始粗加工一個scrapy蜘蛛來完成上述工作,這裏是我迄今爲止所做的。我想觀察的是,登錄工作,我能夠檢索第一條消息。

class Sg101Spider(BaseSpider): 
    name = "sg101" 
    msg_id = 1    # current message to retrieve 
    max_msg_id = 21399  # last message to retrieve 

    def start_requests(self): 
     return [FormRequest(LOGIN_URL, 
      formdata={'login': LOGIN, 'passwd': PASSWORD}, 
      callback=self.logged_in)] 

    def logged_in(self, response): 
     if response.url == 'http://my.yahoo.com': 
      self.log("Successfully logged in. Now requesting 1st message.") 
      return Request(MSG_URL % self.msg_id, callback=self.parse_msg, 
        errback=self.error) 
     else: 
      self.log("Login failed.") 

    def parse_msg(self, response): 
     self.log("Got message!") 
     print response.body 

    def error(self, failure): 
     self.log("I haz an error") 

當我雖然運行蜘蛛,我把它登錄,併發出第一條消息的要求:一旦我得到這麼多的工作,我會完成其餘部分。但是,我在scrapy的調試輸出中看到的所有內容都是3個重定向,最終到達我首先要求的URL。但scrapy不會調用我的parse_msg()回調,並且爬行停止。這裏是scrapy輸出的片段:

2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened 
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login> 
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> 
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None) 
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message. 
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> 
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> 
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished) 
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished) 

我無法理解這一點。它看起來像雅虎重新定向蜘蛛(也許用於驗證檢查?),但它似乎回到了我想要訪問的網址。但是scrapy並沒有調用我的回調函數,我也沒有機會抓取數據或繼續爬行。

有沒有人有什麼想法和/或如何進一步調試?謝謝!

回答

5

我認爲雅虎重定向了授權檢查,最終它將我重定向回我真正想要獲得的頁面。然而,Scrapy已經看到了這個請求,並且因爲它不想進入循環而停下來。在我的情況下,解決方案是將dont_filter=True添加到Request構造函數。這將指示Scrapy不過濾出重複的請求。這對我來說很好,因爲我事先知道我想要抓取的URL。

def logged_in(self, response): 
    if response.url == 'http://my.yahoo.com': 
     self.log("Successfully logged in. Now requesting message page.", 
       level=log.INFO) 
     return Request(MSG_URL % self.msg_id, callback=self.parse_msg, 
       errback=self.error, dont_filter=True) 
    else: 
     self.log("Login failed.", level=log.CRITICAL)