2014-03-04 69 views
1

我試圖從本網站刮取一些基本數據作爲練習,以瞭解有關scrapy的更多信息,以及作爲大學項目的概念證明: http://steamdb.info/sales/Scrapy - 聲明瞭非ascii字符,但沒有聲明編碼

當我使用scrapy殼我能得到我使用以下XPath想要的信息:

sel.xpath(‘//tbody/tr[1]/td[2]/a/text()’).extract() 

應返回表的第一行的遊戲的標題,在結構:

<tbody> 
    <tr> 
      <td></td> 
      <td><a>stuff I want here</a></td> 
... 

它在殼中。

然而,當我試圖把這個變成了一隻蜘蛛(steam.py):

1 from scrapy.spider import BaseSpider 
2 from scrapy.selector import HtmlXPathSelector 
3 from steam_crawler.items import SteamItem 
4 from scrapy.selector import Selector 
5 
6 class SteamSpider(BaseSpider): 
7  name = "steam" 
8  allowed_domains = ["http://steamdb.info/"] 
9  start_urls = ['http://steamdb.info/sales/?displayOnly=all&category=0&cc=uk'] 
10  def parse(self, response): 
11   sel = Selector(response) 
12   sites = sel.xpath("//tbody") 
13   items = [] 
14   count = 1 
15   for site in sites: 
16    item = SteamItem() 
17    item ['title'] = sel.xpath('//tr['+ str(count) +']/td[2]/a/text()').extract().encode('utf-8') 
18    item ['price'] = sel.xpath('//tr['+ str(count) +']/td[@class=「price-final」]/text()').extract().encode('utf-8') 
19    items.append(item) 
20    count = count + 1 
21   return items 

我得到以下錯誤:

ricks-mbp:steam_crawler someuser$ scrapy crawl steam -o items.csv -t csv 
Traceback (most recent call last): 
    File "/usr/local/bin/scrapy", line 5, in <module> 
    pkg_resources.run_script('Scrapy==0.20.0', 'scrapy') 
    File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 492, in run_script 

    File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 1350, in run_script 
    for name in eagers: 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module> 
    execute() 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute 
    _run_print_help(parser, _run_command, cmd, args, opts) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help 
    func(*a, **kw) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command 
    cmd.run(args, opts) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/commands/crawl.py", line 47, in run 
    crawler = self.crawler_process.create_crawler() 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler 
    self.crawlers[name] = Crawler(self.settings) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 25, in __init__ 
    self.spiders = spman_cls.from_crawler(self) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler 
    sm = cls.from_settings(crawler.settings) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings 
    return cls(settings.getlist('SPIDER_MODULES')) 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__ 
    for module in walk_modules(name): 
    File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules 
    submod = import_module(fullpath) 
    File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module 
    __import__(name) 
    File "/xxx/scrape/steam/steam_crawler/spiders/steam.py", line 18 
SyntaxError: Non-ASCII character '\xe2' in file /xxx/scrape/steam/steam_crawler/spiders/steam.py on line 18, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details 

我有一種感覺,我需要do是以某種方式告訴scrapy,這些角色會遵循utf-8而不是ascii--就像有些人一樣。但是從我能收集到的信息來看,它應該從頁面頭部收集這些信息,本網站爲:

<meta charset="utf-8"> 

讓我感到莫名其妙!任何不是scrapy的洞察力/閱讀本身我都會對它感興趣!

回答

3

好像你正在使用代替雙引號"

順便說一句,一個更好的做法是環路上的所有表中的行會是這樣的:

for tr in sel.xpath("//tr"): 
    item = SteamItem() 
    item ['title'] = tr.xpath('td[2]/a/text()').extract() 
    item ['price'] = tr.xpath('td[@class="price-final"]/text()').extract() 
    yield item 
+0

這看起來很簡單得多,它的工作就像一個夢。你是如何學習scrapy的?書籍/教程? – Lorienas