2012-12-18 30 views
0

我開始測試Scrapy以抓取網站,但是當我測試我的代碼時,出現一個我似乎無法理解如何解決的錯誤。使用Scrapy進行爬網時出現異常錯誤

以下是錯誤輸出:

... 
2012-12-18 02:07:19+0000 [dmoz] DEBUG: Crawled (200) <GET http://MYURL.COM> (referer: None) 
2012-12-18 02:07:19+0000 [dmoz] ERROR: Spider error processing <GET http://MYURL.COM> 
    Traceback (most recent call last): 
     File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 1178, in mainLoop 
     self.runUntilCurrent() 
     File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/base.py", line 800, in runUntilCurrent 
     call.func(*call.args, **call.kw) 
     File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 368, in callback 
     self._startRunCallbacks(result) 
     File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 464, in _startRunCallbacks 
     self._runCallbacks() 
    --- <exception caught here> --- 
     File "/usr/local/lib/python2.7/dist-packages/Twisted-12.2.0-py2.7-linux-x86_64.egg/twisted/internet/defer.py", line 551, in _runCallbacks 
     current.result = callback(current.result, *args, **kw) 
     File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.16.3-py2.7.egg/scrapy/spider.py", line 57, in parse 
     raise NotImplementedError 
    exceptions.NotImplementedError: 

2012-12-18 02:07:19+0000 [dmoz] INFO: Closing spider (finished) 
2012-12-18 02:07:19+0000 [dmoz] INFO: Dumping Scrapy stats: 
    {'downloader/request_bytes': 357, 
    'downloader/request_count': 1, 
    'downloader/request_method_count/GET': 1, 
    'downloader/response_bytes': 20704, 
    'downloader/response_count': 1, 
    'downloader/response_status_count/200': 1, 
    'finish_reason': 'finished', 
    'finish_time': datetime.datetime(2012, 12, 18, 2, 7, 19, 595977), 
    'log_count/DEBUG': 7, 
    'log_count/ERROR': 1, 
    'log_count/INFO': 4, 
    'response_received_count': 1, 
    'scheduler/dequeued': 1, 
    'scheduler/dequeued/memory': 1, 
    'scheduler/enqueued': 1, 
    'scheduler/enqueued/memory': 1, 
    'spider_exceptions/NotImplementedError': 1, 
    'start_time': datetime.datetime(2012, 12, 18, 2, 7, 18, 836322)} 

看起來這可能與我的parse功能和回調做。我試圖刪除rule,它的工作,但只爲1個單一的網址,我需要的是爬行整個網站。

這是我的代碼

from scrapy.spider import BaseSpider 
from scrapy.selector import HtmlXPathSelector 
from tutorial.items import DmozItem 
from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item 


class DmozSpider(BaseSpider): 
    name = "dmoz" 
    start_urls = ["http://MYURL.COM"] 
    rules = (Rule(SgmlLinkExtractor(allow_domains=('http://MYURL.COM',)), callback='parse_l', follow=True),) 


    def parse_l(self, response): 
     hxs = HtmlXPathSelector(response) 
     sites = hxs.select('//div[@class=\'content\']') 
     items = [] 
     for site in sites: 
      item = DmozItem() 
      item['title'] = site.select('//div[@class=\'gig-title-g\']/h1').extract() 
      item['link'] = site.select('//ul[@class=\'gig-stats prime\']/li[@class=\'queue \']/div[@class=\'big-txt\']').extract() 
      item['desc'] = site.select('//li[@class=\'thumbs\'][1]/div[@class=\'gig-stats-numbers\']/span').extract() 
      items.append(item) 
     return items 

在正確的方向的任何尖端將受到讚賞。

非常感謝!

+0

在這裏看到第二個答案:http://stackoverflow.com/questions/5264829/why-does-scrapy-throw-an-error-for-me-when-trying-to-spider-and-parse-a- site – Pspi

回答

3

找到這個問題的答案:

Why does scrapy throw an error for me when trying to spider and parse a site?

它看起來像BaseSpider如果你偶然發現了這個問題並沒有實現Rule

並且正在使用BaseSpider抓取您需要將其更改爲CrawlSpider並將其導入爲http://doc.scrapy.org/en/latest/topics/spiders.html

from scrapy.contrib.spiders import CrawlSpider, Rule 
+1

我的代碼與上面發佈的代碼類似,我正在使用'from scrapy.spiders import CrawlSpider,Rule',我仍然遇到同樣的錯誤,如何糾正它? **編輯:** 我想出現給定網址的網站圖 –