1
我試圖從本網站刮取一些基本數據作爲練習,以瞭解有關scrapy的更多信息,以及作爲大學項目的概念證明: http://steamdb.info/sales/Scrapy - 聲明瞭非ascii字符,但沒有聲明編碼
當我使用scrapy殼我能得到我使用以下XPath想要的信息:
sel.xpath(‘//tbody/tr[1]/td[2]/a/text()’).extract()
應返回表的第一行的遊戲的標題,在結構:
<tbody>
<tr>
<td></td>
<td><a>stuff I want here</a></td>
...
它在殼中。
然而,當我試圖把這個變成了一隻蜘蛛(steam.py):
1 from scrapy.spider import BaseSpider
2 from scrapy.selector import HtmlXPathSelector
3 from steam_crawler.items import SteamItem
4 from scrapy.selector import Selector
5
6 class SteamSpider(BaseSpider):
7 name = "steam"
8 allowed_domains = ["http://steamdb.info/"]
9 start_urls = ['http://steamdb.info/sales/?displayOnly=all&category=0&cc=uk']
10 def parse(self, response):
11 sel = Selector(response)
12 sites = sel.xpath("//tbody")
13 items = []
14 count = 1
15 for site in sites:
16 item = SteamItem()
17 item ['title'] = sel.xpath('//tr['+ str(count) +']/td[2]/a/text()').extract().encode('utf-8')
18 item ['price'] = sel.xpath('//tr['+ str(count) +']/td[@class=「price-final」]/text()').extract().encode('utf-8')
19 items.append(item)
20 count = count + 1
21 return items
我得到以下錯誤:
ricks-mbp:steam_crawler someuser$ scrapy crawl steam -o items.csv -t csv
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 5, in <module>
pkg_resources.run_script('Scrapy==0.20.0', 'scrapy')
File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 492, in run_script
File "build/bdist.macosx-10.9-intel/egg/pkg_resources.py", line 1350, in run_script
for name in eagers:
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>
execute()
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 89, in _run_print_help
func(*a, **kw)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/cmdline.py", line 150, in _run_command
cmd.run(args, opts)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/commands/crawl.py", line 47, in run
crawler = self.crawler_process.create_crawler()
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 87, in create_crawler
self.crawlers[name] = Crawler(self.settings)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/crawler.py", line 25, in __init__
self.spiders = spman_cls.from_crawler(self)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 35, in from_crawler
sm = cls.from_settings(crawler.settings)
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 31, in from_settings
return cls(settings.getlist('SPIDER_MODULES'))
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/spidermanager.py", line 22, in __init__
for module in walk_modules(name):
File "/Library/Python/2.7/site-packages/Scrapy-0.20.0-py2.7.egg/scrapy/utils/misc.py", line 68, in walk_modules
submod = import_module(fullpath)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/importlib/__init__.py", line 37, in import_module
__import__(name)
File "/xxx/scrape/steam/steam_crawler/spiders/steam.py", line 18
SyntaxError: Non-ASCII character '\xe2' in file /xxx/scrape/steam/steam_crawler/spiders/steam.py on line 18, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
我有一種感覺,我需要do是以某種方式告訴scrapy,這些角色會遵循utf-8而不是ascii--就像有些人一樣。但是從我能收集到的信息來看,它應該從頁面頭部收集這些信息,本網站爲:
<meta charset="utf-8">
讓我感到莫名其妙!任何不是scrapy的洞察力/閱讀本身我都會對它感興趣!
這看起來很簡單得多,它的工作就像一個夢。你是如何學習scrapy的?書籍/教程? – Lorienas