python新手,來自php。我想使用Scrapy來抓取一些網站,並且很好地學習了教程和簡單的腳本。現在寫實打實的來此錯誤:Scrapy傳遞響應,缺少一個位置參數
Traceback (most recent call last):
File "C:\Users\Naltroc\Miniconda3\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw)
File "C:\Users\Naltroc\Documents\Python Scripts\tutorial\tutorial\spiders\quotes_spider.py", line 52, in parse self.dispatchersite
TypeError: thesaurus() missing 1 required positional argument: 'response'
Scrapy自動實例化時的shell命令scrapy crawl words
被稱爲對象。
據我所知,self
是任何類方法的第一個參數。在調用類方法時,不會將self
作爲參數傳遞給您的變量。
首先這就是所謂的:
# Scrapy automatically provides `response` to `parse()` when coming from `start_requests()`
def parse(self, response):
site = response.meta['site']
#same as "site = thesaurus"
self.dispatcher[site](response)
#same as "self.dispatcher['thesaurus'](response)
然後
def thesaurus(self, response):
filename = 'thesaurus.txt'
words = ''
ul = response.css('.relevancy-block ul')
for idx, u in enumerate(ul):
if idx == 1:
break;
words = u.css('.text::text').extract()
self.save_words(filename, words)
在PHP中,這應該是與調用$this->thesaurus($response)
。 parse
顯然發送response
作爲一個變量,但python說它缺少。 它去了哪裏?
全部代碼在這裏:
import scrapy
class WordSpider(scrapy.Spider):
def __init__(self, keyword = 'apprehensive'):
self.k = keyword
name = "words"
# Utilities
def make_csv(self, words):
csv = ''
for word in words:
csv += word + ','
return csv
def save_words(self, words, fp):
with ofpen(fp, 'w') as f:
f.seek(0)
f.truncate()
csv = self.make_csv(words)
f.write(csv)
# site specific parsers
def thesaurus(self, response):
filename = 'thesaurus.txt'
words = ''
print("in func self is defined as ", self)
ul = response.css('.relevancy-block ul')
for idx, u in enumerate(ul):
if idx == 1:
break;
words = u.css('.text::text').extract()
print("words is ", words)
self.save_words(filename, words)
def oxford(self):
filename = 'oxford.txt'
words = ''
def collins(self):
filename = 'collins.txt'
words = ''
# site/function mapping
dispatcher = {
'thesaurus': thesaurus,
'oxford': oxford,
'collins': collins,
}
def parse(self, response):
site = response.meta['site']
self.dispatcher[site](response)
def start_requests(self):
urls = {
'thesaurus': 'http://www.thesaurus.com/browse/%s?s=t' % self.k,
#'collins': 'https://www.collinsdictionary.com/dictionary/english-thesaurus/%s' % self.k,
#'oxford': 'https://en.oxforddictionaries.com/thesaurus/%s' % self.k,
}
for site, url in urls.items():
print(site, url)
yield scrapy.Request(url, meta={'site': site}, callback=self.parse)
謝謝你的評論。 1.如果我知道它總是隻用'keyword'作爲參數,是否有理由在'__init__'中添加'** kwargs'? 2.它看起來像'parse'函數作爲一個控制器,首先得到正確的解析器然後傳遞數據。這是合理的,但它是發送「響應」數據的唯一方式嗎? 3.爲什麼使用'getattr(self,response.meta ['site'])'允許調用適當的方法而不用'self.'作爲前綴? – Naltroc
關於#1。既然你從蜘蛛繼承你想傳遞kwargs到父類,這裏沒有什麼值得傳遞的東西,但這是一個讓這個未來證明的模式。 2.您誤解了scrapy的工作原理,默認情況下,蜘蛛會啓動一系列'start_urls'中的每個url的請求鏈,並使用默認回調'parse()',其中response是其中一個start_urls的響應對象。你誤解了什麼是自我; 'self'是對當前類對象的引用,所以當使用'getattr'時,你不需要它,因爲它爲你提供了一個獨立的引用。 – Granitosaurus