2015-07-28 83 views
0

我有我的最後一個問題在這裏:last questionScrapy不是爬行

現在我已經盡了全力,以考慮並提高自己的蜘蛛的結構。然而,由於某些原因,我的蜘蛛仍然無法開始抓取。

我也檢查過xpath,他們的工作(在鉻控制檯)。

我加入了帶有href的url,因爲href總是隻返回參數。我在最後一個問題中附加了一個示例鏈接格式。 (我想保持這個帖子AWAG從越來越冗長)

我的蜘蛛:

class kmssSpider(scrapy.Spider): 
    name='kmss' 
    start_url = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#{unid=ADE682E34FC59D274825770B0037D278}' 
    login_page = 'https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login' 
    allowed_domain = ["kmssqkr.hksarg"] 

    def start_requests(self): 
     yield Request(url=self.login_page, callback=self.login ,dont_filter = True 
       ) 
    def login(self,response): 
     return FormRequest.from_response(response,formdata={'user':'usename','password':'pw'}, 
             callback = self.check_login_response) 

    def check_login_response(self,response): 
     if 'Welcome' in response.body: 
      self.log("\n\n\n\n Successfuly Logged in \n\n\n ") 
      yield Request(url=self.start_url, 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 
     else: 
      self.log("\n\n You are not logged in \n\n ") 

    def parse(self,response): 
     listattheleft = response.xpath("*//*[@class='qlist']/li[not(contains(@role,'menuitem'))]") 
     anyfolder = response.xpath("*//*[@class='q-folderItem']/h4") 
     anyfile = response.xpath("*//*[@class='q-otherItem']/h4") 
     for each_tab in listattheleft: 
      item = CrawlkmssItem() 
      item['url'] = each_tab.xpath('a/@href').extract() 
      item['title'] = each_tab.xpath('a/text()').extract() 
      yield item 

      if 'unid' not in each_tab.xpath('./a').extract(): 
       parameter = each_tab.xpath('a/@href').extract() 
       locatetheroom = parameter.find('PageLibrary') 
       item['room'] = parameter[locatetheroom:] 
       locatethestart = response.url.find('#',0) 
       full_url = response.url[:locatethestart] + parameter 
       yield Request(url=full_url, 
           cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 

     for folder in anyfolder: 
      folderparameter = folder.xpath('a/@href').extract() 
      locatethestart = response.url.find('#',0) 
      folder_url = response.url[:locatethestart]+ folderparameter 
      yield Request(url=folder_url, callback='parse_folder', 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           )   

     for File in anyfile: 
      fileparameter = File.xpath('a/@href').extract() 
      locatethestart = response.url.find('#',0) 
      file_url = response.url[:locatethestart] + fileparameter 
      yield Request(url=file_url, callback='parse_file', 
          cookies={'LtpaToken2':'jHxHvqs+NeT...'} 
           ) 

    def parse_folder(self,response): 
     findfolder = response.xpath("//div[@class='lotusHeader']") 
     folderitem= CrawlkmssFolder() 
     folderitem['foldername'] = findfolder.xpath('h1/span/span/text()').extract() 
     folderitem['url']= response.url[response.url.find("unid=")+5:]  
     yield folderitem 


    def parse_file(self,response): 
     findfile = response.xpath("//div[@class='lotusContent']") 
     fileitem = CrawlkmssFile() 
     fileitem['filename']=findfile.xpath('a/text()').extract() 
     fileitem['title']=findfile.xpath(".//div[@class='qkrTitle']/span/@title").extract() 
     fileitem['author']=findfile.xpath(".//div[@class='lotusMeta']/span[3]/span/text()").extract() 
     yield fileitem 

我打算抓取的信息:

左側欄出現:

enter image description here

文件夾:

enter image description here

日誌:

c:\Users\~\crawlKMSS>scrapy crawl kmss 
2015-07-28 17:54:59 [scrapy] INFO: Scrapy 1.0.1 started (bot: crawlKMSS) 
2015-07-28 17:54:59 [scrapy] INFO: Optional features available: ssl, http11, boto 
2015-07-28 17:54:59 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'crawlKMSS.spiders', 'SPIDER_MODULES': ['crawlKMSS.spiders'], 'BOT_NAME': 'crawlKMSS'} 
2015-07-28 17:54:59 [py.warnings] WARNING: :0: UserWarning: You do not have a working installation of the service_identity module: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected. 

2015-07-28 17:54:59 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2015-07-28 17:54:59 [boto] DEBUG: Retrieving credentials from metadata server. 
2015-07-28 17:55:00 [boto] ERROR: Caught exception reading instance data 
Traceback (most recent call last): 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url 
    r = opener.open(req, timeout=timeout) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 431, in open 
    response = self._open(req, data) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 449, in _open 
    '_open', req) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 409, in _call_chain 
    result = func(*args) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1227, in http_open 
    return self.do_open(httplib.HTTPConnection, req) 
    File "C:\Users\yclam1\AppData\Local\Continuum\Anaconda\lib\urllib2.py", line 1197, in do_open 
    raise URLError(err) 
URLError: <urlopen error timed out> 
2015-07-28 17:55:00 [boto] ERROR: Unable to read instance data, giving up 
2015-07-28 17:55:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, HttpProxyMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2015-07-28 17:55:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2015-07-28 17:55:01 [scrapy] INFO: Enabled item pipelines: 
2015-07-28 17:55:01 [scrapy] INFO: Spider opened 
2015-07-28 17:55:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2015-07-28 17:55:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2015-07-28 17:55:05 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login> (referer: None) 
2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <POST https://kmssqkr..hksarg/names.nsf?Login> (referer: https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf?OpenDatabase&Login) 
2015-07-28 17:55:10 [kmss] DEBUG: 



Successfuly Logged in 



2015-07-28 17:55:10 [scrapy] DEBUG: Crawled (200) <GET https://kmssqkr.hksarg/LotusQuickr/dept/Main.nsf/h_RoomHome/ade682e34fc59d274825770b0037d278/?OpenDocument#%7Bunid=ADE682E34FC59D274825770B0037D278%7D> (referer: https://kmssqkr.hksarg/names.nsf?Login) 
2015-07-28 17:55:10 [scrapy] INFO: Closing spider (finished) 
2015-07-28 17:55:10 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1636, 

希望得到任何幫助!

回答

1

我覺得你太過於複雜了,爲什麼當你有scrapy.Crawler時,你會從類scrapy.Spider繼承繁重的工作?一個Spider通常用於抓取頁面列表,而Crawler用於抓取網站。

這是抓取常規網站最常用的蜘蛛,因爲它提供了一個方便的機制,通過定義一組規則來跟蹤鏈接。

+0

我面臨着python知識的限制,我使用crawlspider時遇到困難,尤其是當我需要從網站的多個地方提取項目/登錄後不開始爬行 – yukclam9

1

您的日誌中存在警告,並且您的回溯表明在打開httpConnection時出現此錯誤。

2015年7月28日17時54分59秒[py.warnings]警告:0:UserWarning:你不是 有service_identity模塊的工作安裝: '無模塊 命名service_identity'。請從 https://pypi.python.org/pypi/service_identity安裝它,並確保其所有的依賴關係都滿足 。如果沒有service_identity模塊 和最近足夠的pyOpenSSL來支持它,Twisted只能執行 基本的TLS客戶端主機名驗證。許多有效的 證書/主機名映射可能會被拒絕。

+0

這非常棘手(我認爲),因爲我已經安裝了最新版本的服務標識和pyopenssl。我應該如何處理這個錯誤? – yukclam9

+0

棘手的事情確實是以某種方式說服scrapy必須是錯誤的。因爲scrapy告訴你:**你沒有安裝service_identity模塊**。或者嘗試修復你的設置可能會更容易。 –