2016-10-03 93 views
-1

我與Scrapy/Python的初學者,我開發了一個爬蟲能找到過期的域名和掃描每一個SEO API
我的抓取工作正常,但我很確定抓取工具不是100%優化的工作。

請問有可能有一些技巧來改進履帶車嗎?尋求幫助,以提高履帶

expired.py:

class HttpbinSpider(CrawlSpider): 
    name = "expired" 

    rules = (
     Rule(LxmlLinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino', '.eu'), 
           deny=('facebook', 'amazon', 'wordpress', 'blogspot', 'free', 'reddit'), 
      callback='parse_obj', 
      process_request='add_errback', 
      follow=True), 
    ) 

    def __init__(self, domains=None, **kwargs): 
     self.start_urls = json.loads(domains) 
     super(HttpbinSpider, self).__init__() 

    def add_errback(self, request): 
     return request.replace(errback=self.errback_httpbin) 

    def errback_httpbin(self, failure): 
     if failure.check(DNSLookupError): 
      request = failure.request 
      ext = tldextract.extract(request.url) 
      domain = ext.registered_domain 
      if domain != '': 
       domain = domain.replace("%20", "") 
       self.check_domain(domain) 

    def check_domain(self, domain): 
     if self.is_available(domain) == 'AVAILABLE': 

      self.logger.info('## Domain Expired : %s', domain) 

      url = 'http://api.majestic.com/api/json?app_api_key=API&cmd=GetIndexItemInfo&items=1&item0=' + domain + '&datasource=fresh' 
      response = urllib.urlopen(url) 
      data = json.loads(response.read()) 
      response.close() 

      TrustFlow = data['DataTables']['Results']['Data'][0]['TrustFlow'] 
      CitationFlow = data['DataTables']['Results']['Data'][0]['CitationFlow'] 
      RefDomains = data['DataTables']['Results']['Data'][0]['RefDomains'] 
      ExtBackLinks = data['DataTables']['Results']['Data'][0]['ExtBackLinks'] 

      if (RefDomains > 20) and (TrustFlow > 4) and (CitationFlow > 4): 
       insert_table(domain, TrustFlow, CitationFlow, RefDomains, ExtBackLinks) 

    def is_available(self, domain): 
     url = 'https://api.internet.bs/Domain/Check?ApiKey=KEY&Password=PSWD&responseformat=json&domain' + domain 
     response = urllib.urlopen(url) 
     data = json.loads(response.read()) 
     response.close() 
     return data['status'] 

感謝的很多。

回答

1

代碼中最大的問題是阻止整個異步scrapy例程的urllib請求。您可以通過產生scrapy.Request來輕鬆替換使用scrapy請求鏈的用戶。

事情是這樣的:

def errback_httpbin(self, failure): 
    if not failure.check(DNSLookupError): 
     return 
    request = failure.request 
    ext = tldextract.extract(request.url) 
    domain = ext.registered_domain 
    if domain == '': 
     logging.debug('no domain: {}'.format(request.url)) 
     return 
    domain = domain.replace("%20", "") 
    url = 'https://api.internet.bs/Domain/Check?ApiKey=KEY&Password=PSWD&responseformat=json&domain=' + domain 
    return Request(url, self.parse_checkdomain) 

def parse_checkdomain(self, response): 
    """check whether domain is available""" 
    data = json.loads(response.read()) 
    if data['status'] == 'AVAILABLE': 
     self.logger.info('Domain Expired : {}'.format(data['domain'])) 
     url = 'http://api.majestic.com/api/json?app_api_key=API&cmd=GetIndexItemInfo&items=1&item0=' + data['domain']+ '&datasource=fresh' 
     return Request(url, self.parse_claim) 

def parse_claim(self, response): 
    """save available domain's details""" 
    data = json.loads(response.read()) 
    # eliminate redundancy 
    results = data['DataTables']['Results']['Data'][0] 
    # snake case is more pythonic 
    trust_flow = results['TrustFlow'] 
    citation_flow = results['CitationFlow'] 
    ref_domains = results['RefDomains'] 
    ext_back_links = results['ExtBackLinks'] 

    # don't need to wrap everything in() 
    if ref_domains > 20 and trust_flow > 4 and citation_flow > 4: 
     insert_table(domain, trust_flow, citation_flow, ref_domains, ext_back_links) 

這樣,你的代碼不被堵塞,完全asynchronious。一般來說,在處理scrapy蜘蛛中的http時,您不想使用scrapy請求以外的任何內容。

+0

非常感謝您的幫助和改進代碼。我會嘗試的! – Pixel

+0

我正在使用BloomFilter,並且在日誌中引發了大量錯誤'raise IndexError(「BloomFilter is at capacity」)'。你知道爲什麼嗎 ? – Pixel

+2

@像素對不起,我不是很熟悉它。根據pybloom wiki的AFAIK,它應該打開一個新的更大的實例。創建過濾器時,您可以嘗試指定容量。我不認爲這是與scrapy相關的,你可能想爲它開闢一個新的問題:) – Granitosaurus