2014-12-03 106 views
0

我想抓取一些使用scrapy的網站。以下是一個示例代碼。方法解析沒有被調用。我試圖通過反應堆服務(代碼提供)運行代碼。所以,我從擁有反應堆的startCrawling.py運行它。我知道我錯過了一些東西。你能幫忙嗎?Python Scrapy-無法抓取

感謝,

Code-categorization.py 

from scrapy.contrib.spiders.init import InitSpider 
from scrapy.http import Request, FormRequest 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.contrib.spiders import Rule 
from scrapy.selector import Selector 
from scrapy.selector import HtmlXPathSelector 
from items.items import CategorizationItem 
from scrapy.contrib.spiders.crawl import CrawlSpider 
class TestingSpider(CrawlSpider): 
     print 'in spider' 
     name = 'testSpider' 
     allowed_domains = ['wikipedia.org'] 
     start_urls = ['http://www.wikipedia.org'] 
     def parse(self, response): 

      # Scrape data from page 
      print 'here' 
      open('test.html','wb').write(response.body) 

代碼 - startCrawling.py

from twisted.internet import reactor 
from scrapy.crawler import Crawler 
from scrapy.settings import Settings 
from scrapy import log, signals 
from scrapy.xlib.pydispatch import dispatcher 
from scrapy.utils.project import get_project_settings 

from spiders.categorization import TestingSpider 

# Scrapy spiders script... 

def stop_reactor(): 
    reactor.stop #@UndefinedVariable  
    print 'hi' 

    dispatcher.connect(stop_reactor, signal=signals.spider_closed) 
    spider = TestingSpider() 
    crawler = Crawler(Settings()) 
    crawler.configure() 
    crawler.crawl(spider) 
    crawler.start() 
    reactor.run()#@UndefinedVariable 

回答

2

你不應該使用CrawlSpider時覆蓋parse()方法。您應該在Rule中以不同的名稱設置自定義callback
這裏是從official documentation摘錄:

當寫抓取蜘蛛規則,應避免使用解析作爲回調,由於 的CrawlSpider使用解析方法本身執行其邏輯。 因此,如果您重寫解析方法,抓取蜘蛛將不再工作 。

+0

謝謝。我正在接受答案。我會試試這個,讓你知道。 – user1930402 2014-12-05 09:26:20

+1

有史以來最快的接受,我只是點擊,它變成了綠色:) – bosnjak 2014-12-05 09:26:46