1
我正在嘗試抓取一些選定的域名,並且只抓取這些網站中的重要網頁。我的方法是抓取域中的一個網頁並獲取一組網址,這些網址將抓取我在第一個網頁上找到的重複出現的網址。通過這種方式,我嘗試去除所有不再發生的網址(內容網址,例如產品等)。我尋求幫助的原因是因爲scrapy.Request未被執行超過一次。 這是我到目前爲止有:Scrapy抓取多個域名,每個域名有重複發生的網址
class Finder(scrapy.Spider):
name = "finder"
start_urls = ['http://www.nu.nl/']
uniqueDomainUrl = dict()
maximumReoccurringPages = 5
rules = (
Rule(
LinkExtractor(
allow=('.nl', '.nu', '.info', '.net', '.com', '.org', '.info'),
deny=('facebook','amazon', 'wordpress', 'blogspot', 'free', 'reddit',
'videos', 'youtube', 'google', 'doubleclick', 'microsoft', 'yahoo',
'bing', 'znet', 'stackexchang', 'twitter', 'wikipedia', 'creativecommons',
'mediawiki', 'wikidata'),
),
process_request='parse',
follow=True
),
)
def parse(self, response):
self.logger.info('Entering URL: %s', response.url)
currentUrlParse = urlparse.urlparse(response.url)
currentDomain = currentUrlParse.hostname
if currentDomain in self.uniqueDomainUrl:
yield
self.uniqueDomainUrl[currentDomain] = currentDomain
item = ImportUrlList()
response.meta['item'] = item
# Reoccurring URLs
item = self.findReoccurringUrls(response)
list = item['list']
self.logger.info('Output: %s', list)
# Crawl reoccurring urls
#for href in list:
# yield scrapy.Request(response.urljoin(href), callback=self.parse)
def findReoccurringUrls(self, response):
self.logger.info('Finding reoccurring URLs in: %s', response.url)
item = response.meta['item']
urls = self.findUrlsOnCurrentPage(response)
item['list'] = urls
response.meta['item'] = item
# Get all URLs on each web page (limit 5 pages)
i = 0
for value in urls:
i += 1
if i > self.maximumReoccurringPages:
break
self.logger.info('Parse: %s', value)
request = Request(value, callback=self.test, meta={'item':item})
item = request.meta['item']
return item
def test(self, response):
self.logger.info('Page title: %s', response.css('title').extract())
item = response.meta['item']
urls = self.findUrlsOnCurrentPage(response)
item['list'] = set(item['list']) & set(urls)
return item
def findUrlsOnCurrentPage(self, response):
newUrls = []
currentUrlParse = urlparse.urlparse(response.url)
currentDomain = currentUrlParse.hostname
currentUrl = currentUrlParse.scheme +'://'+ currentUrlParse.hostname
for href in response.css('a::attr(href)').extract():
newUrl = urlparse.urljoin(currentUrl, href)
urlParse = urlparse.urlparse(newUrl)
domain = urlParse.hostname
if href.startswith('#'):
continue
if domain != currentDomain:
continue
if newUrl not in newUrls:
newUrls.append(newUrl)
return newUrls
這似乎是隻有執行的第一頁,其他請求()不稱爲我可以在回調見。
ImportUrlList只包含一個列表中的字段=字典(); 我想重新使用findUrlsOnCurrentPage,所以我爲回調做了一個新的函數,因爲我正在試驗這個函數,我稱之爲測試。 在第一次調用時,函數已經提取了一個頁面,所以我不需要再次執行請求。 –