因此,我建立了一個scrapy蜘蛛爬過網站內的所有內部鏈接。但是,當我運行蜘蛛時,有一些網站的大部分網站與網站內容無關。例如,一個網站運行詹金斯,而我的蜘蛛花費大量時間瀏覽與網站無關的這些網頁。防止scrapy蜘蛛爬行網站的一部分太長
一種方法是創建一個黑名單並添加一些路徑,如詹金斯,但我想知道是否有更好的方式來處理這個問題。
class MappingItem(dict, BaseItem):
pass
class WebsiteSpider(scrapy.Spider):
name = "Website"
def __init__(self):
item = MappingItem()
self.loader = ItemLoader(item)
self.filter_urls = list()
def start_requests(self):
filename = "filename.csv"
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
base_url = urlparse(seed_url).netloc
self.filter_urls.append(base_url)
request = Request(seed_url, callback=self.parse_seed)
request.meta['base_url'] = base_url
yield request
except IOError:
raise CloseSpider("A list of websites are needed")
def parse_seed(self, response):
base_url = response.meta['base_url']
# handle external redirect while still allowing internal redirect
if urlparse(response.url).netloc != base_url:
return
external_le = LinkExtractor(deny_domains=base_url)
external_links = external_le.extract_links(response)
for external_link in external_links:
if urlparse(external_link.url).netloc in self.filter_urls:
self.loader.add_value(base_url, external_link.url)
internal_le = LinkExtractor(allow_domains=base_url)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link.url, callback=self.parse_seed)
request.meta['base_url'] = base_url
request.meta['dont_redirect'] = True
yield request
您是否正在使用鏈接提取?顯示您的蜘蛛代碼的相關部分可能有助於在這裏幫助。謝謝! – alecxe