-1
我與Scrapy/Python的初學者,我開發了一個爬蟲能找到過期的域名和掃描每一個SEO API。
我的抓取工作正常,但我很確定抓取工具不是100%優化的工作。
請問有可能有一些技巧來改進履帶車嗎?尋求幫助,以提高履帶
expired.py:
class HttpbinSpider(CrawlSpider):
name = "expired"
rules = (
Rule(LxmlLinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino', '.eu'),
deny=('facebook', 'amazon', 'wordpress', 'blogspot', 'free', 'reddit'),
callback='parse_obj',
process_request='add_errback',
follow=True),
)
def __init__(self, domains=None, **kwargs):
self.start_urls = json.loads(domains)
super(HttpbinSpider, self).__init__()
def add_errback(self, request):
return request.replace(errback=self.errback_httpbin)
def errback_httpbin(self, failure):
if failure.check(DNSLookupError):
request = failure.request
ext = tldextract.extract(request.url)
domain = ext.registered_domain
if domain != '':
domain = domain.replace("%20", "")
self.check_domain(domain)
def check_domain(self, domain):
if self.is_available(domain) == 'AVAILABLE':
self.logger.info('## Domain Expired : %s', domain)
url = 'http://api.majestic.com/api/json?app_api_key=API&cmd=GetIndexItemInfo&items=1&item0=' + domain + '&datasource=fresh'
response = urllib.urlopen(url)
data = json.loads(response.read())
response.close()
TrustFlow = data['DataTables']['Results']['Data'][0]['TrustFlow']
CitationFlow = data['DataTables']['Results']['Data'][0]['CitationFlow']
RefDomains = data['DataTables']['Results']['Data'][0]['RefDomains']
ExtBackLinks = data['DataTables']['Results']['Data'][0]['ExtBackLinks']
if (RefDomains > 20) and (TrustFlow > 4) and (CitationFlow > 4):
insert_table(domain, TrustFlow, CitationFlow, RefDomains, ExtBackLinks)
def is_available(self, domain):
url = 'https://api.internet.bs/Domain/Check?ApiKey=KEY&Password=PSWD&responseformat=json&domain' + domain
response = urllib.urlopen(url)
data = json.loads(response.read())
response.close()
return data['status']
感謝的很多。
非常感謝您的幫助和改進代碼。我會嘗試的! – Pixel
我正在使用BloomFilter,並且在日誌中引發了大量錯誤'raise IndexError(「BloomFilter is at capacity」)'。你知道爲什麼嗎 ? – Pixel
@像素對不起,我不是很熟悉它。根據pybloom wiki的AFAIK,它應該打開一個新的更大的實例。創建過濾器時,您可以嘗試指定容量。我不認爲這是與scrapy相關的,你可能想爲它開闢一個新的問題:) – Granitosaurus