爲什麼我的scrapy蜘蛛沒有刮掉任何東西？

我不知道問題出在哪裏，因爲我是scrapy的新手，可能超級容易修復。謝謝你的幫助！爲什麼我的scrapy蜘蛛沒有刮掉任何東西？

我的蜘蛛：

from scrapy.spiders import CrawlSpider, Rule 
from scrapy.selector import HtmlXPathSelector 
from scrapy.linkextractors import LinkExtractor 
from scrapy.item import Item 

class ArticleSpider(CrawlSpider): 
    name = "article" 
    allowed_domains = ["economist.com"] 
    start_urls = ['http://www.economist.com/sections/science-technology'] 

    rules = [ 
     Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True), 
    ] 

    def parse_item(self, response): 
     for sel in response.xpath('//div/article'): 
      item = scrapy.Item() 
      item ['title'] = sel.xpath('a/text()').extract() 
      item ['link'] = sel.xpath('a/@href').extract() 
      item ['desc'] = sel.xpath('text()').extract() 
      return item

項目：

import scrapy 

class EconomistItem(scrapy.Item): 
    title = scrapy.Field() 
    link = scrapy.Field() 
    desc = scrapy.Field()

日誌的一部分：

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
Crawled (200) <GET http://www.economist.com/sections/science-technology> (referer: None)

編輯：

後，我加入由alecxe另一個問題提出的修改發生：

登錄：

[scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology) 
2016-02-04 14:05:01 [scrapy] DEBUG: Crawled (200) <GET http://www.economist.com/news/science-and-technology/21689501-beating-go-champion-machine-learning-computer-says-go> (referer: http://www.economist.com/sections/science-technology) 
2016-02-04 14:05:02 [scrapy] ERROR: Spider error processing <GET http://www.economist.com/news/science-and-technology/21688848-stem-cells-are-starting-prove-their-value-medical-treatments-curing-multiple> (referer: http://www.economist.com/sections/science-technology) 
Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/site-packages/scrapy/utils/defer.py", line 102, in iter_errback 
    yield next(it) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/offsite.py", line 28, in process_spider_output 
    for x in result: 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/referer.py", line 22, in <genexpr> 
    return (_set_referer(r) for r in result or()) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spidermiddlewares/depth.py", line 54, in <genexpr> 
    return (r for r in result or() if _filter(r)) 
    File "/usr/local/lib/python2.7/site-packages/scrapy/spiders/crawl.py", line 67, in _parse_response 
    cb_res = callback(response, **cb_kwargs) or() 
    File "/Users/FvH/Desktop/Python/projects/economist/economist/spiders/article.py", line 18, in parse_item 
    item = scrapy.Item() 
NameError: global name 'scrapy' is not defined

設置：

BOT_NAME = 'economist' 

    SPIDER_MODULES = ['economist.spiders'] 
    NEWSPIDER_MODULE = 'economist.spiders' 
    USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"

如果我想將數據導出到CSV文件時，它顯然只是空的。

感謝

來源

2016-02-04 peter

parse_item不正確縮進，應該是：

class ArticleSpider(CrawlSpider): 
    name = "article" 
    allowed_domains = ["economist.com"] 
    start_urls = ['http://www.economist.com/sections/science-technology'] 

    rules = [ 
     Rule(LinkExtractor(allow=r'Items'), callback='parse_item', follow=True), 
    ] 

    def parse_item(self, response): 
     for sel in response.xpath('//div/article'): 
      item = scrapy.Item() 
      item ['title'] = sel.xpath('a/text()').extract() 
      item ['link'] = sel.xpath('a/@href').extract() 
      item ['desc'] = sel.xpath('text()').extract() 
      return item

有兩點需要從一邊修復：

鏈接提取部分，應固定匹配文章鏈接：

Rule(LinkExtractor(restrict_xpaths='//article'), callback='parse_item', follow=True),

你需要指定USER_AGENT setting假裝是一個真正的瀏覽器。否則，response將不包含的文章列表：

USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36"

來源

2016-02-04 04:25:22 alecxe

感謝alecxe我說你說的話和，但因爲有其他錯誤，現在顯然我做錯了。謝謝 – peter

@peter你只需要在蜘蛛內部有'import scrapy'。或者，我認爲你的意思是初始化項目中定義的項目，而不是'scrapy.Item（）'。 – alecxe

你只有進口產品（不是所有的scrapy模塊）：

from scrapy.item import Item

因此，而不是在這裏使用scrapy.Item：

for sel in response.xpath('//div/article'): 
     item = scrapy.Item() 
     item ['title'] = sel.xpath('a/text()').extract()

你應該只使用項目：

for sel in response.xpath('//div/article'): 
     item = Item() 
     item ['title'] = sel.xpath('a/text()').extract()

或導入您自己的物品使用它。這應該工作（不要忘記你的項目的名稱，以取代PROJECT_NAME）：

from project_name.items import EconomistItem 
... 
for sel in response.xpath('//div/article'): 
     item = EconomistItem() 
     item ['title'] = sel.xpath('a/text()').extract()

來源

2016-02-05 09:34:25

爲什麼我的scrapy蜘蛛沒有刮掉任何東西？

回答

相關問題