2012-06-08 72 views
11

所以我已經通讀了Crawling with an authenticated session in Scrapy,我已經掛斷了,我99%確定我的解析代碼是正確的,我只是不相信登錄正在重定向並取得成功。使用Scrapy進行身份驗證時抓取LinkedIn

我也有問題,check_login_response()不知道它檢查哪個頁面。雖然「退出」是有道理的。




====== UPDATED ======

from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from linkedpy.items import LinkedPyItem 

class LinkedPySpider(InitSpider): 
    name = 'LinkedPy' 
    allowed_domains = ['linkedin.com'] 
    login_page = 'https://www.linkedin.com/uas/login' 
    start_urls = ["http://www.linkedin.com/csearch/results?type=companies&keywords=&pplSearchOrigin=GLHD&pageKey=member-home&search=Search#facets=pplSearchOrigin%3DFCTD%26keywords%3D%26search%3DSubmit%26facet_CS%3DC%26facet_I%3D80%26openFacets%3DJO%252CN%252CCS%252CNFR%252CF%252CCCR%252CI"] 

    def init_request(self): 
     #"""This function is called before crawling starts.""" 
     return Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
     #"""Generate a login request.""" 
     return FormRequest.from_response(response, 
        formdata={'session_key': '[email protected].com', 'session_password': 'somepassword'}, 
        callback=self.check_login_response) 

    def check_login_response(self, response): 
     #"""Check the response returned by a login request to see if we aresuccessfully logged in.""" 
     if "Sign Out" in response.body: 
      self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n") 
      # Now the crawling can begin.. 

      return self.initialized() # ****THIS LINE FIXED THE LAST PROBLEM***** 

     else: 
      self.log("\n\n\nFailed, Bad times :(\n\n\n") 
      # Something went wrong, we couldn't log in, so nothing happens. 

    def parse(self, response): 
     self.log("\n\n\n We got data! \n\n\n") 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//ol[@id=\'result-set\']/li') 
     items = [] 
     for site in sites: 
      item = LinkedPyItem() 
      item['title'] = site.select('h2/a/text()').extract() 
      item['link'] = site.select('h2/a/@href').extract() 
      items.append(item) 
     return items 



問題是由自前添加 '返回' 解決.initialized()

再次感謝! 馬克

+0

當你運行上面的代碼時會發生什麼? – Acorn

+0

' 'request_depth_max':1, \t '調度器/ memory_enqueued':3, \t 'START_TIME':datetime.datetime(2012,6,8,18,31,48,252601)} 2012-06-08 14 :31:49-0400 [LinkedPy]信息:蜘蛛關閉(完成) 2012-06-08 14:31:49-0400 [scrapy]信息:傾銷全球統計:{ – Gates

+2

這種信息應該放在你的原始問題而不是評論。 – Acorn

回答

2
class LinkedPySpider(BaseSpider): 

應該是:

class LinkedPySpider(InitSpider): 

而且正如我在答覆中提到這裏,你不應該重寫parse功能:https://stackoverflow.com/a/5857202/crawling-with-an-authenticated-session-in-scrapy

如果你不知道如何定義提取鏈接的規則,只需正確閱讀文檔:
http://readthedocs.org/docs/scrapy/en/latest/topics/spiders.html#scrapy.contrib.spiders.Rule
http://readthedocs.org/docs/scrapy/en/latest/topics/link-extractors.html#topics-link-extractors

+0

這確實有幫助。我看到成功的日誌。 **但是**我不確定'def parse(self,response):'實際上是在運行。我試圖把self.log()放到那裏,沒有任何返回。 – Gates

+0

此外''start_urls'網址似乎不會出現在日誌中 – Gates

+0

看起來'parse()'應該是'parse_item()' – Gates