2
我對Python比較陌生,所以對任何幫助/建議表示讚賞。從腳本運行Scrapy,需要幫助理解它
我想建立一個腳本,將運行一個Scrapy蜘蛛。 到目前爲止,我有下面的代碼,
from scrapy.contrib.loader import XPathItemLoader
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from scrapy.crawler import CrawlerProcess
class QuestionItem(Item):
"""Our SO Question Item"""
title = Field()
summary = Field()
tags = Field()
user = Field()
posted = Field()
votes = Field()
answers = Field()
views = Field()
class MySpider(BaseSpider):
"""Our ad-hoc spider"""
name = "myspider"
start_urls = ["http://stackoverflow.com/"]
question_list_xpath = '//div[@id="content"]//div[contains(@class, "question- summary")]'
def parse(self, response):
hxs = HtmlXPathSelector(response)
for qxs in hxs.select(self.question_list_xpath):
loader = XPathItemLoader(QuestionItem(), selector=qxs)
loader.add_xpath('title', './/h3/a/text()')
loader.add_xpath('summary', './/h3/a/@title')
loader.add_xpath('tags', './/a[@rel="tag"]/text()')
loader.add_xpath('user', './/div[@class="started"]/a[2]/text()')
loader.add_xpath('posted', './/div[@class="started"]/a[1]/span/@title')
loader.add_xpath('votes', './/div[@class="votes"]/div[1]/text()')
loader.add_xpath('answers', './/div[contains(@class, "answered")]/div[1]/text()')
loader.add_xpath('views', './/div[@class="views"]/div[1]/text()')
yield loader.load_item()
class CrawlerWorker(Process):
def __init__(self, spider, results):
Process.__init__(self)
self.results = results
self.crawler = CrawlerProcess(settings)
if not hasattr(project, 'crawler'):
self.crawler.install()
self.crawler.configure()
self.items = []
self.spider = spider
dispatcher.connect(self._item_passed, signals.item_passed)
def _item_passed(self, item):
self.items.append(item)
def run(self):
self.crawler.crawl(self.spider)
self.crawler.start()
self.crawler.stop()
self.results.put(self.items)
def main():
results = Queue()
crawler = CrawlerWorker(MySpider(BaseSpider), results)
crawler.start()
for item in results.get():
pass # Do something with item
我得到下面這個錯誤,
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
C:\Python27\lib\site-packages\twisted\internet\win32eventreactor.py:64: UserWarn
ing: Reliable disconnection notification requires pywin32 215 or later
category=UserWarning)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 374, in main
self = load(from_parent)
File "C:\Python27\lib\pickle.py", line 1378, in load
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 858, in load
dispatch[key](self)
File "C:\Python27\lib\pickle.py", line 1090, in load_global
klass = self.find_class(module, name)
File "C:\Python27\lib\pickle.py", line 1124, in find_class
__import__(module)
File "Webscrap.py", line 53, in <module>
class CrawlerWorker(Process):
NameError: name 'Process' is not defined
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (157, 0))
...
"PicklingError: <function remove at 0x07871CB0>: Can't pickle <function remove at 0x077F6BF0>: it's not found as weakref.remove".
我意識到我正在做的事情在邏輯上是錯誤的。對此我是新手,我無法發現它。任何人都可以給我一些幫助,讓這個代碼運行?
最終,我只想要一個腳本,它將運行,廢棄所需的數據,並將其存儲在數據庫中,但首先我希望獲得只是刮擦工作。我認爲這會運行它,但目前還沒有運氣。