所以我已經通讀了Crawling with an authenticated session in Scrapy,我已經掛斷了,我99%確定我的解析代碼是正確的,我只是不相信登錄正在重定向並取得成功。使用Scrapy進行身份驗證時抓取LinkedIn
我也有問題,check_login_response()不知道它檢查哪個頁面。雖然「退出」是有道理的。
====== UPDATED ======
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import Rule
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from linkedpy.items import LinkedPyItem
class LinkedPySpider(InitSpider):
name = 'LinkedPy'
allowed_domains = ['linkedin.com']
login_page = 'https://www.linkedin.com/uas/login'
start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"]
def init_request(self):
#"""This function is called before crawling starts."""
return Request(url=self.login_page, callback=self.login)
def login(self, response):
#"""Generate a login request."""
return FormRequest.from_response(response,
formdata={'session_key': '[email protected].com', 'session_password': 'somepassword'},
callback=self.check_login_response)
def check_login_response(self, response):
#"""Check the response returned by a login request to see if we aresuccessfully logged in."""
if "Sign Out" in response.body:
self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n")
# Now the crawling can begin..
return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM*****
else:
self.log("\n\n\nFailed, Bad times :(\n\n\n")
# Something went wrong, we couldn't log in, so nothing happens.
def parse(self, response):
self.log("\n\n\n We got data! \n\n\n")
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ol[@id=\'result-set\']/li')
items = []
for site in sites:
item = LinkedPyItem()
item['title'] = site.select('h2/a/text()').extract()
item['link'] = site.select('h2/a/@href').extract()
items.append(item)
return items
問題是由自前添加 '返回' 解決.initialized()
再次感謝! 馬克
當你運行上面的代碼時會發生什麼? – Acorn
' 'request_depth_max':1, \t '調度器/ memory_enqueued':3, \t 'START_TIME':datetime.datetime(2012,6,8,18,31,48,252601)} 2012-06-08 14 :31:49-0400 [LinkedPy]信息:蜘蛛關閉(完成) 2012-06-08 14:31:49-0400 [scrapy]信息:傾銷全球統計:{ – Gates
這種信息應該放在你的原始問題而不是評論。 – Acorn