0
我需要在廣泛抓取過程中抓取前10-20個內部鏈接,因此我不會影響Web服務器,但是存在太多「allowed_domains」的域。我在這裏問,因爲Scrapy文檔不包括這一點,我無法通過Google找到答案。Scrapy廣泛抓取 - 在廣泛抓取時只允許內部鏈接,allowed_domains允許使用太多域名
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class DomainLinks(Item):
links = Field()
class ScapyProject(CrawlSpider):
name = 'scapyproject'
#allowed_domains = []
start_urls = ['big domains list loaded from database']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)
def parse_start_url(self, response):
self.parse_links(response)
def parse_links(self, response):
item = DomainLinks()
item['links'] = []
domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]
links = LxmlLinkExtractor(allow=(),deny =()).extract_links(response)
links = [link for link in links if domain in link.url]
# Filter duplicates and append to
for link in links:
if link.url not in item['links']:
item['links'].append(link.url)
return item
是下列理解過濾鏈接,而無需使用allowed_domains列表和LxmlLinkExtractor允許過濾器的最好方式,因爲這些似乎都使用正則表達式,這將影響性能並限制允許域列表的大小,如果每個廢棄的鏈接是否與列表中的每個域進行正則表達式匹配?
links = [link for link in links if domain in link.url]
我正在努力解決的另一個問題是,我如何讓蜘蛛只跟隨內部鏈接而不使用allowed_domains列表?自定義中間件?
感謝