2013-01-23 24 views
2

我想抓取下面顯示的這些頁面,但需要進行身份驗證。 嘗試了下面的代碼,但它說0頁被刮。 我無法理解這是什麼問題。 有人能幫助請..Scrapy:無法使用登錄功能刮頁

from scrapy.selector import HtmlXPathSelector 
from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 
from kappaal.items import KappaalItem 

class KappaalCrawler(InitSpider): 
    name = "initkappaal" 
    allowed_domains = ["http://www.kappaalphapsi1911.com/"] 
    login_page = 'http://www.kappaalphapsi1911.com/login.aspx' 
    #login_page = 'https://kap.site-ym.com/Login.aspx' 
    start_urls = ["http://www.kappaalphapsi1911.com/search/newsearch.asp?cdlGroupID=102044"] 

    rules = (Rule(SgmlLinkExtractor(allow= r'-\w$'), callback='parseItems', follow=True),) 
    #rules = (Rule(SgmlLinkExtractor(allow=("*",),restrict_xpaths=("//*[contains(@id, 'SearchResultsGrid')]",)) , callback="parseItems", follow= True),) 

    def init_request(self): 
     """This function is called before crawling starts.""" 
     return Request(url=self.login_page, callback=self.login) 

    def login(self, response): 
     """Generate a login request.""" 
     return FormRequest.from_response(response, 
        formdata={'u': 'username', 'p': 'password'}, 
        callback=self.check_login_response) 

    def check_login_response(self, response): 
     """Check the response returned by a login request to see if we are 
     successfully logged in. 
     """ 
     if "Member Search Results" in response.body: 
      self.log("Successfully logged in. Let's start crawling!") 
      # Now the crawling can begin.. 
      self.initialized() 
     else: 
      self.log("Bad times :(") 
      # Something went wrong, we couldn't log in, so nothing happens. 

    def parseItems(self, response): 
     hxs = HtmlXPathSelector(response) 
     members = hxs.select('/html/body/form/div[3]/div/table/tbody/tr/td/div/table[2]/tbody') 
     print members 
     items = [] 
     for member in members: 
      item = KappaalItem() 
      item['Name'] = member.select("//a/text()").extract() 
      item['MemberLink'] = member.select("//a/@href").extract() 
      #item['EmailID'] = 
      #print item['Name'], item['MemberLink'] 
      items.append(item) 
     return items 

執行刮板

2013-01-23 07:08:23+0530 [scrapy] INFO: Scrapy 0.16.3 started (bot: kappaal) 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats,  TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled downloader middlewares:HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Enabled item pipelines: 
2013-01-23 07:08:23+0530 [initkappaal] INFO: Spider opened 
2013-01-23 07:08:23+0530 [initkappaal] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 
2013-01-23 07:08:23+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Crawled (200) <GET https://kap.site-ym.com/Login.aspx> (referer: None) 
2013-01-23 07:08:26+0530 [initkappaal] DEBUG: Filtered offsite request to 'kap.site-ym.com': <GET https://kap.site-ym.com/search/all.asp?bst=Enter+search+criteria...&p=P%40ssw0rd&u=9900146> 
2013-01-23 07:08:26+0530 [initkappaal] INFO: Closing spider (finished) 
2013-01-23 07:08:26+0530 [initkappaal] INFO: Dumping Scrapy stats: 
     {'downloader/request_bytes': 231, 
     'downloader/request_count': 1, 
     'downloader/request_method_count/GET': 1, 
     'downloader/response_bytes': 23517, 
     'downloader/response_count': 1, 
     'downloader/response_status_count/200': 1, 
     'finish_reason': 'finished', 
     'finish_time': datetime.datetime(2013, 1, 23, 1, 38, 26, 194000), 
     'log_count/DEBUG': 8, 
    'log_count/INFO': 4, 
    'request_depth_max': 1, 
    'response_received_count': 1, 
    'scheduler/dequeued': 1, 
    'scheduler/dequeued/memory': 1, 
    'scheduler/enqueued': 1, 
    'scheduler/enqueued/memory': 1, 
    'start_time': datetime.datetime(2013, 1, 23, 1, 38, 23, 542000)} 
2013-01-23 07:08:26+0530 [initkappaal] INFO: Spider closed (finished) 

我不明白爲什麼它不驗證和解析提到的起始URL後獲得如下響應。

+0

您的調試輸出與提供的代碼不匹配。請更新此輸出。 – Talvalin

回答

0

好的,所以我可以看到一些問題。但是,由於用戶名和密碼,我無法測試代碼。有沒有可用於測試目的的虛擬帳戶?

  1. InitSpider沒有實現規則,所以雖然它不會導致問題,應該刪除。
  2. ​​需要返回一些東西。

機智:

def check_login_response(self, response): 
    """Check the response returned by a login request to see if we are 
    successfully logged in. 
    """ 
    if "Member Search Results" in response.body: 
     self.log("Successfully logged in. Let's start crawling!") 
     # Now the crawling can begin.. 
     return self.initialized() 
    else: 
     self.log("Bad times :(") 
     # Something went wrong, we couldn't log in, so nothing happens. 
     return 
0

這可能不是你要找的反應,但我覺得你的痛苦......

我遇到了同樣的問題,我覺得文檔不足夠用於Scrapy。我最終使用機械化登錄。如果您現在認爲scrapy效果很好,Mechanize非常簡單。

1

另外,還要確保你已經啓用,這樣當您登錄那麼會話保持登錄

COOKIES_ENABLED = True 
COOKIES_DEBUG = True 

settings.py文件

1

我固定它這樣的餅乾:

def start_requests(self): 
    return self.init_request() 

def init_request(self): 
    return [Request(url=self.login_page, callback=self.login)] 

def login(self, response): 
    return FormRequest.from_response(response, formdata={'username': 'username', 'password': 'password'}, callback=self.check_login_response) 

def check_login_response(self, response): 
    if "Logout" in response.body: 
     for url in self.start_urls: 
      yield self.make_requests_from_url(url) 
    else: 
     self.log("Could not log in...") 

通過重載start_requests,您可以確保登錄過程正確地結束,只有這樣您才能開始抓取。

我正在使用CrawlSpider,並且完美地工作!希望能幫助到你。