2011-03-07 34 views
6

我用Python的scrapy工具在Python中編寫了一個爬蟲程序。以下是Python代碼:python中的Scrapy Crawler無法關聯鏈接?

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
#from scrapy.item import Item 
from a11ypi.items import AYpiItem 

class AYpiSpider(CrawlSpider): 
     name = "AYpi" 
     allowed_domains = ["a11y.in"] 
     start_urls = ["http://a11y.in/a11ypi/idea/firesafety.html"] 

     rules =(
       Rule(SgmlLinkExtractor(allow =()) ,callback = 'parse_item') 
       ) 

     def parse_item(self,response): 
       #filename = response.url.split("/")[-1] 
       #open(filename,'wb').write(response.body) 
       #testing codes^(the above) 

       hxs = HtmlXPathSelector(response) 
       item = AYpiItem() 
       item["foruri"] = hxs.select("//@foruri").extract() 
       item["thisurl"] = response.url 
       item["thisid"] = hxs.select("//@foruri/../@id").extract() 
       item["rec"] = hxs.select("//@foruri/../@rec").extract() 
       return item 

但是,而不是跟隨鏈接拋出的錯誤是:

Traceback (most recent call last): 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 131, in execute 
    _run_print_help(parser, _run_command, cmd, args, opts) 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 97, in _run_print_help 
    func(*a, **kw) 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/cmdline.py", line 138, in _run_command 
    cmd.run(args, opts) 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/commands/crawl.py", line 45, in run 
    q.append_spider_name(name, **opts.spargs) 
--- <exception caught here> --- 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/queue.py", line 89, in append_spider_name 
    spider = self._spiders.create(name, **spider_kwargs) 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/spidermanager.py", line 36, in create 
    return self._spiders[spider_name](**spider_kwargs) 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 38, in __init__ 
    self._compile_rules() 
    File "/usr/lib/python2.6/site-packages/Scrapy-0.12.0.2538-py2.6.egg/scrapy/contrib/spiders/crawl.py", line 82, in _compile_rules 
    self._rules = [copy.copy(r) for r in self.rules] 
exceptions.TypeError: 'Rule' object is not iterable 

可有人請向我解釋這是怎麼回事?由於這是文檔中提到的東西,我將allow字段留空,本身應該默認遵循True。那麼,爲什麼出錯?我可以使用我的抓取工具進行哪種優化以加快速度?

回答

32

從我看來,它看起來像你的規則是不可迭代的。它看起來像你試圖使規則成爲一個元組,你應該read up on tuples in the python documentation

解決您的問題,改變這一行:

rules =(
      Rule(SgmlLinkExtractor(allow =()) ,callback = 'parse_item') 
      ) 

要:

rules =(Rule(SgmlLinkExtractor(allow =()) ,callback = 'parse_item'),) 

注意逗號在結束了嗎?

+1

解決了我的問題,謝謝。 – alex 2017-07-30 23:36:04