0
創建弱參考SCRAPY「海峽」對象我已經寫了使用scrapy在python像這樣下面蜘蛛:類型錯誤:無法在Python
#!/usr/bin/python
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.selector import Selector
class GivenSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
select = Selector(response.body)
title = select.xpath("//a[@class=listinglink]/@href").extract()
print title
# for t in title:
# title4 = MyItem()
# title4['content'] = t
# yield title4
# filename = response.url.split("/")[-2] + '.html'
# with open(filename, 'wb') as f:
# f.write(response.body)
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(GivenSpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
我運行它:
$ python runTimeSpider.py
我給出的以下輸出是:
INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
INFO: Enabled item pipelines:
INFO: Spider opened
INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
DEBUG: Telnet console listening on 127.0.0.1:6023
DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "runTimeSpider.py", line 17, in parse
select = Selector(str(response.body))
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 24, in __new__
cache = cls.cache.setdefault(response, {})
File "/usr/lib/python2.7/weakref.py", line 433, in setdefault
return self.data.setdefault(ref(key, self._remove),default)
TypeError: cannot create weak reference to 'str' object
ERROR: Spider error processing <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None)
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "runTimeSpider.py", line 17, in parse
select = Selector(str(response.body))
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/unified.py", line 80, in __init__
_root = LxmlDocument(response, self._parser)
File "/usr/local/lib/python2.7/dist-packages/scrapy/selector/lxmldocument.py", line 24, in __new__
cache = cls.cache.setdefault(response, {})
File "/usr/lib/python2.7/weakref.py", line 433, in setdefault
return self.data.setdefault(ref(key, self._remove),default)
TypeError: cannot create weak reference to 'str' object
INFO: Closing spider (finished)
INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 514,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16284,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 1, 21, 8, 28, 26, 17960),
'log_count/DEBUG': 3,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/TypeError': 2,
'start_time': datetime.datetime(2016, 1, 21, 8, 28, 24, 986319)}
INFO: Spider closed (finished)
如何打印標題? Ut有錯誤:
TypeError: cannot create weak reference to 'str' object
謝謝verymuch – MLSC