0
爲了我當前的知識,我已經寫了一個能夠通過變量嵌套深度遞歸爬取的小型web蜘蛛/爬蟲還可以做可選的POST/GET pre抓取之前登錄(如果需要)。scrapy遞歸鏈接爬蟲與登錄 - 幫助我提高
由於我是一個完整的初學者,我想獲得一些反饋,改進或任何你在此拋出的東西。
我只在這裏添加parser
函數。可以在github上查看整個源代碼:https://github.com/cytopia/crawlpy
我真正想確定的是,與yield
結合的遞歸儘可能高效,並且我也以正確的方式執行此操作。
任何意見和編碼風格非常受歡迎。
def parse(self, response):
"""
Scrapy parse callback
"""
# Get current nesting level
if response.meta.has_key('depth'):
curr_depth = response.meta['depth']
else:
curr_depth = 1
# Only crawl the current page if we hit a HTTP-200
if response.status == 200:
hxs = Selector(response)
links = hxs.xpath("//a/@href").extract()
# We stored already crawled links in this list
crawled_links = []
# Pattern to check proper link
linkPattern = re.compile("^(?:http|https):\/\/(?:[\w\.\-\+]+:{0,1}[\w\.\-\+]*@)?(?:[a-z0-9\-\.]+)(?::[0-9]+)?(?:\/|\/(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+)|\?(?:[\w#!:\.\?\+=&%@!\-\/\(\)]+))?$")
for link in links:
# Link could be a relative url from response.url
# such as link: '../test', respo.url: http://dom.tld/foo/bar
if link.find('../') == 0:
link = response.url + '/' + link
# Prepend BASE URL if it does not have it
elif 'http://' not in link and 'https://' not in link:
link = self.base_url + link
# If it is a proper link and is not checked yet, yield it to the Spider
if (link
and linkPattern.match(link)
and link.find(self.base_url) == 0):
#and link not in crawled_links
#and link not in uniques):
# Check if this url already exists
re_exists = re.compile('^' + link + '$')
exists = False
for i in self.uniques:
if re_exists.match(i):
exists = True
break
if not exists:
# Store the shit
crawled_links.append(link)
self.uniques.append(link)
# Do we recurse?
if curr_depth < self.depth:
request = Request(link, self.parse)
# Add meta-data about the current recursion depth
request.meta['depth'] = curr_depth + 1
yield request
else:
# Nesting level too deep
pass
else:
# Link not in condition
pass
#
# Final return (yield) to user
#
for url in crawled_links:
#print "FINAL FINAL FINAL URL: " + response.url
item = CrawlpyItem()
item['url'] = url
item['depth'] = curr_depth
yield item
#print "FINAL FINAL FINAL URL: " + response.url
#item = CrawlpyItem()
#item['url'] = response.url
#yield item
else:
# NOT HTTP 200
pass
感謝您的簡化。看起來更清潔。然而,這有兩個問題。我得到很多重複的鏈接(例如保存到json時),並且存儲了我允許的域之外的鏈接。我忽略了什麼? – cytopia
@cytopia好抓!蜘蛛中存在一個巨大的缺陷,它會在下載它們之前返回Url,所以scrapy重複過濾器和allowed_domain過濾器從未被實際使用過。我已經修復了這個問題! – Granitosaurus
Granitosaurus感謝您的編輯。一些註釋: 'item ['url'] = link.url':現在'link'在for循環中定義之前被使用。或者你的意思是'response.url'?另外,請你解釋一下,爲什麼你默認「深度」爲零而不是1? – cytopia