我剛剛接觸網絡抓取,剛開始嘗試使用Python編寫的抓取框架Scrapy。我的目標是刮掉舊的雅虎集團,因爲他們沒有提供API或任何其他手段來檢索郵件歸檔。雅虎組的設置使得您必須先登錄才能查看存檔。使用scrapy刮掉雅虎組的問題
我需要完成,我認爲,這些步驟是:
- 登錄到雅虎
- 訪問的URL第一條消息和刮它
- 重複步驟2的下一條消息,等
我開始粗加工一個scrapy蜘蛛來完成上述工作,這裏是我迄今爲止所做的。我想觀察的是,登錄工作,我能夠檢索第一條消息。
class Sg101Spider(BaseSpider):
name = "sg101"
msg_id = 1 # current message to retrieve
max_msg_id = 21399 # last message to retrieve
def start_requests(self):
return [FormRequest(LOGIN_URL,
formdata={'login': LOGIN, 'passwd': PASSWORD},
callback=self.logged_in)]
def logged_in(self, response):
if response.url == 'http://my.yahoo.com':
self.log("Successfully logged in. Now requesting 1st message.")
return Request(MSG_URL % self.msg_id, callback=self.parse_msg,
errback=self.error)
else:
self.log("Login failed.")
def parse_msg(self, response):
self.log("Got message!")
print response.body
def error(self, failure):
self.log("I haz an error")
當我雖然運行蜘蛛,我把它登錄,併發出第一條消息的要求:一旦我得到這麼多的工作,我會完成其餘部分。但是,我在scrapy的調試輸出中看到的所有內容都是3個重定向,最終到達我首先要求的URL。但scrapy不會調用我的parse_msg()
回調,並且爬行停止。這裏是scrapy輸出的片段:
2011-02-03 19:50:10-0600 [sg101] INFO: Spider opened
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (302) to <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com> from <POST https://login.yahoo.com/config/login>
2011-02-03 19:50:10-0600 [sg101] DEBUG: Redirecting (meta refresh) to <GET http://my.yahoo.com> from <GET https://login.yahoo.com/config/verify?.done=http%3a//my.yahoo.com>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Crawled (200) <GET http://my.yahoo.com> (referer: None)
2011-02-03 19:50:12-0600 [sg101] DEBUG: Successfully logged in. Now requesting 1st message.
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1>
2011-02-03 19:50:12-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?done=http%3A%2F%2Flaunch.groups.yahoo.com%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] DEBUG: Redirecting (302) to <GET http://launch.groups.yahoo.com/group/MyYahooGroup/message/1> from <GET http://launch.groups.yahoo.com/group/MyYahooGroup/auth?check=G&done=http%3A%2F%2Flaunch%2Egroups%2Eyahoo%2Ecom%2Fgroup%2FMyYahooGroup%2Fmessage%2F1>
2011-02-03 19:50:13-0600 [sg101] INFO: Closing spider (finished)
2011-02-03 19:50:13-0600 [sg101] INFO: Spider closed (finished)
我無法理解這一點。它看起來像雅虎重新定向蜘蛛(也許用於驗證檢查?),但它似乎回到了我想要訪問的網址。但是scrapy並沒有調用我的回調函數,我也沒有機會抓取數據或繼續爬行。
有沒有人有什麼想法和/或如何進一步調試?謝謝!