如何辨別python scrapy移動到下一個起始URL

我寫了一個scrapy蜘蛛，它有很多start_urls並在這些url中提取電子郵件地址。該腳本需要很長時間才能執行，因此我想告訴Scrapy在發現電子郵件並移至下一個網站時停止抓取特定網站。如何辨別python scrapy移動到下一個起始URL

編輯：添加的代碼

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item 
import csv 
from urlparse import urlparse 

from entreprise.items import MailItem 

class MailSpider(CrawlSpider): 
    name = "mail" 
    start_urls = [] 
    allowed_domains = [] 
    with open('scraped_data.csv', 'rb') as csvfile: 
     reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
     next(reader) 
     for row in reader: 
      url = row[5].strip() 
      if (url.strip() != ""): 
       start_urls.append(url) 
       fragments = urlparse(url).hostname.split(".") 
       hostname = ".".join(len(fragments[-2]) < 4 and fragments[-3:] or fragments[-2:]) 
       allowed_domains.append(hostname) 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item') 
    ] 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     for mail in hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+'): 
      item = MailItem() 
      item['url'] = response.url 
      item['mail'] = mail 
      items.append(item) 
     return items

來源

2013-06-03 madmed

能否請您出示蜘蛛的代碼？這將有助於回答。 – alecxe

的想法是使用start_requests方法來決定哪些URL旁邊爬行。此外，我們會跟蹤在parsed_hostnames級別集合中是否針對主機名解析了電子郵件。

另外，我已經改變了從url獲取主機名的方式，現在使用urlparse。

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 
import csv 
from urlparse import urlparse 


class MailItem(Item): 
    url = Field() 
    mail = Field() 


class MailSpider(CrawlSpider): 
    name = "mail" 

    parsed_hostnames= set() 
    allowed_domains = [] 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item') 
    ] 

    def start_requests(self): 
     with open('scraped_data.csv', 'rb') as csvfile: 
      reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
      next(reader) 

      for row in reader: 
       url = row[5].strip() 
       if url: 
        hostname = urlparse(url).hostname 
        if hostname not in self.parsed_hostnames: 
         if hostname not in self.allowed_domains: 
          self.allowed_domains.append(hostname) 
          self.rules[0].link_extractor.allow_domains.add(hostname) 
          self.rules[1].link_extractor.allow_domains.add(hostname) 

         yield self.make_requests_from_url(url) 
        else: 
         self.allowed_domains.remove(hostname) 
         self.rules[0].link_extractor.allow_domains.remove(hostname) 
         self.rules[1].link_extractor.allow_domains.remove(hostname) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     for mail in hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+'): 
      item = MailItem() 
      item['url'] = response.url 
      item['mail'] = mail 
      items.append(item) 

     hostname = urlparse(response.url).hostname 
     self.parsed_hostnames.add(hostname) 

     return items

應該在理論上工作。希望有所幫助。

來源

2013-06-03 09:19:49 alecxe

允許的域沒有正確設置，我測試了代碼，蜘蛛爬過了不在列表中的twitter。 – madmed

好吧，這是因爲'allow_domains'沒有爲您的規則鏈接提取器設置。我編輯了代碼 - 試試看。 – alecxe

link_extractor.allow_domains是一個不是一個列表，所以我用add而不是append。一旦找到電子郵件地址，腳本不會停止抓取當前的域名，所以沒有任何改變。 – madmed

我最終使用process_links

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 
import csv 
from urlparse import urlparse 

class MailItem(Item): 
    url = Field() 
    mail = Field() 

class MailSpider(CrawlSpider): 
    name = "mail" 

    parsed_hostnames= set() 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item', process_links='process_links'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item', process_links='process_links') 
    ] 

    start_urls = [] 
    allowed_domains = [] 
    with open('scraped_data.csv', 'rb') as csvfile: 
     reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
     next(reader) 
     for row in reader: 
      url = row[5].strip() 
      if (url.strip() != ""): 
       start_urls.append(url) 
       hostname = urlparse(url).hostname 
       allowed_domains.append(hostname) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     mails = hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+') 
     if mails: 
      for mail in mails: 
       item = MailItem() 
       item['url'] = response.url 
       item['mail'] = mail 
       items.append(item) 
       hostname = urlparse(response.url).hostname 
       self.parsed_hostnames.add(hostname) 

     return items 

    def process_links(self, links): 
     return [l for l in links if urlparse(l.url).hostname not in self.parsed_hostnames]

來源

2013-06-03 11:09:51 madmed

感謝您的幫助，我找到了解決方案，訣竅是查看CrawlSpider的源代碼，並瞭解它是如何工作的。 – madmed

如何辨別python scrapy移動到下一個起始URL

回答

相關問題