2013-07-11 79 views
2

如何將下面的這個工作示例轉換爲crawlSpider並深入抓取不僅僅是第一個主頁而是深入。這個例子工作正常沒有錯誤,但我想使用爬蟲而不是InitSpider和深入爬行。由於從CrawlSpider提前scrapy將InitSpider轉換爲CrawlSpider w /登錄

from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 

from linkedpy.items import LinkedpyItem 

class LinkedPySpider(InitSpider): 
    name = 'LinkedPy' 
    allowed_domains = ['linkedin.com'] 
    login_page = 'https://www.linkedin.com/uas/login' 
    start_urls = ["http://www.linkedin.com/csearch/results"] 

    def init_request(self): 
    #"""This function is called before crawling starts.""" 
    return Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
    #"""Generate a login request.""" 
    return FormRequest.from_response(response, 
      formdata={'session_key': '[email protected]', 'session_password': 'xxxxx'}, 
      callback=self.check_login_response) 

    def check_login_response(self, response): 
    #"""Check the response returned by a login request to see if we aresuccessfully logged in.""" 
    if "Sign Out" in response.body: 
     self.log("\n\n\nSuccessfully logged in. Let's start crawling!\n\n\n") 
     # Now the crawling can begin.. 

     return self.initialized() 

    else: 
     self.log("\n\n\nFailed, Bad times :(\n\n\n") 
     # Something went wrong, we couldn't log in, so nothing happens. 

    def parse(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ol[@id=\'result-set\']/li') 
    items = [] 
    for site in sites: 
     item = LinkedpyItem() 
     item['title'] = site.select('h2/a/text()').extract() 
     item['link'] = site.select('h2/a/@href').extract() 
     items.append(item) 
    return items 

輸出

2013-07-11 15:50:01-0500 [scrapy] INFO: Scrapy 0.16.5 started (bot: linkedpy) 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetCon 
sole, CloseSpider, WebService, CoreStats, SpiderState 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled downloader middlewares: HttpAut 
hMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, De 
faultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMi 
ddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMi 
ddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddle 
ware 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Enabled item pipelines: 
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Spider opened 
2013-07-11 15:50:01-0500 [LinkedPy] INFO: Crawled 0 pages (at 0 pages/min), scra 
ped 0 items (at 0 items/min) 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:602 
3 
2013-07-11 15:50:01-0500 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Crawled (200) <GET https://www.linked 
in.com/uas/login> (referer: None) 
2013-07-11 15:50:02-0500 [LinkedPy] DEBUG: Redirecting (302) to <GET http://www. 
linkedin.com/nhome/> from <POST https://www.linkedin.com/uas/login-submit> 
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi 
n.com/nhome/> (referer: https://www.linkedin.com/uas/login) 
2013-07-11 15:50:04-0500 [LinkedPy] DEBUG: 


    Successfully logged in. Let's start crawling! 



2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Crawled (200) <GET http://www.linkedi 
n.com/csearch/results> (referer: http://www.linkedin.com/nhome/) 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: 


    We got data! 



2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1009/IBM?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fals 
e_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'IBM']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1033/Accenture?trk=ncsrch_hits&goback=%2Efcs_*2_* 
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Accenture']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1038/Deloitte?trk=ncsrch_hits&goback=%2Efcs_*2_*2 
_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Deloitte']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1035/Microsoft?trk=ncsrch_hits&goback=%2Efcs_*2_* 
2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Microsoft']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1025/Hewlett-Packard?trk=ncsrch_hits&goback=%2Efc 
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Hewlett-Packard']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1028/Oracle?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f 
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Oracle']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1093/Dell?trk=ncsrch_hits&goback=%2Efcs_*2_*2_fal 
se_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Dell']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1123/Bank+of+America?trk=ncsrch_hits&goback=%2Efc 
s_*2_*2_false_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Bank of America']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1015/GE?trk=ncsrch_hits&goback=%2Efcs_*2_*2_false 
_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'GE']} 
2013-07-11 15:50:05-0500 [LinkedPy] DEBUG: Scraped from <200 http://www.linkedin 
.com/csearch/results> 
    {'link': [u'/companies/1441/Google?trk=ncsrch_hits&goback=%2Efcs_*2_*2_f 
alse_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2'], 
    'title': [u'Google']} 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Closing spider (finished) 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 2243, 
    'downloader/request_count': 4, 
    'downloader/request_method_count/GET': 3, 
    'downloader/request_method_count/POST': 1, 
    'downloader/response_bytes': 91349, 
    'downloader/response_count': 4, 
    'downloader/response_status_count/200': 3, 
    'downloader/response_status_count/302': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2013, 7, 11, 20, 50, 5, 177000), 
    'item_scraped_count': 10, 
    'log_count/DEBUG': 22, 
    'log_count/INFO': 4, 
    'request_depth_max': 2, 
    'response_received_count': 3, 
    'scheduler/dequeued': 4, 
    'scheduler/dequeued/memory': 4, 
    'scheduler/enqueued': 4, 
    'scheduler/enqueued/memory': 4, 
    'start_time': datetime.datetime(2013, 7, 11, 20, 50, 1, 649000)} 
2013-07-11 15:50:05-0500 [LinkedPy] INFO: Spider closed (finished) 
+0

如何,如果你有一個問題,關於你寫一些代碼,然後再去做你問在這裏? –

+0

@John我寫了一些代碼併發布了一個問題,但沒有人回答它[上一篇文章](http://stackoverflow.com/questions/17578727/scrapy-log-into-a-site-and-doa-a- crawlspider,但是,無工作) – Gio

回答

6

繼承,只是覆蓋start_requests,而不是init_request

def start_requests(self): 
    yield Request(
     url=self.login_page, 
     callback=self.login, 
     dont_filter=True 
    ) 

parseCrawlSpider使用實際抓取的鏈接的方法,所以將parse方法重命名爲其他內容。

此外,您還可以使用發電機,而不是做一個列表:

def parse_page(self, response): 
    self.log("\n\n\n We got data! \n\n\n") 

    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//ol[@id=\'result-set\']/li') 

    for site in sites: 
     item = LinkedpyItem() 
     item['title'] = site.select('./h2/a/text()').extract() 
     item['link'] = site.select('./h2/a/@href').extract() 

     yield item