2013-06-03 60 views
4

我寫了一個scrapy蜘蛛,它有很多start_urls並在這些url中提取電子郵件地址。該腳本需要很長時間才能執行,因此我想告訴Scrapy在發現電子郵件並移至下一個網站時停止抓取特定網站。如何辨別python scrapy移動到下一個起始URL

編輯:添加的代碼

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item 
import csv 
from urlparse import urlparse 

from entreprise.items import MailItem 

class MailSpider(CrawlSpider): 
    name = "mail" 
    start_urls = [] 
    allowed_domains = [] 
    with open('scraped_data.csv', 'rb') as csvfile: 
     reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
     next(reader) 
     for row in reader: 
      url = row[5].strip() 
      if (url.strip() != ""): 
       start_urls.append(url) 
       fragments = urlparse(url).hostname.split(".") 
       hostname = ".".join(len(fragments[-2]) < 4 and fragments[-3:] or fragments[-2:]) 
       allowed_domains.append(hostname) 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item') 
    ] 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     for mail in hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+'): 
      item = MailItem() 
      item['url'] = response.url 
      item['mail'] = mail 
      items.append(item) 
     return items 
+0

能否請您出示蜘蛛的代碼?這將有助於回答。 – alecxe

回答

2

的想法是使用start_requests方法來決定哪些URL旁邊爬行。此外,我們會跟蹤在parsed_hostnames級別集合中是否針對主機名解析了電子郵件。

另外,我已經改變了從url獲取主機名的方式,現在使用urlparse

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 
import csv 
from urlparse import urlparse 


class MailItem(Item): 
    url = Field() 
    mail = Field() 


class MailSpider(CrawlSpider): 
    name = "mail" 

    parsed_hostnames= set() 
    allowed_domains = [] 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item') 
    ] 

    def start_requests(self): 
     with open('scraped_data.csv', 'rb') as csvfile: 
      reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
      next(reader) 

      for row in reader: 
       url = row[5].strip() 
       if url: 
        hostname = urlparse(url).hostname 
        if hostname not in self.parsed_hostnames: 
         if hostname not in self.allowed_domains: 
          self.allowed_domains.append(hostname) 
          self.rules[0].link_extractor.allow_domains.add(hostname) 
          self.rules[1].link_extractor.allow_domains.add(hostname) 

         yield self.make_requests_from_url(url) 
        else: 
         self.allowed_domains.remove(hostname) 
         self.rules[0].link_extractor.allow_domains.remove(hostname) 
         self.rules[1].link_extractor.allow_domains.remove(hostname) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     for mail in hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+'): 
      item = MailItem() 
      item['url'] = response.url 
      item['mail'] = mail 
      items.append(item) 

     hostname = urlparse(response.url).hostname 
     self.parsed_hostnames.add(hostname) 

     return items 

應該在理論上工作。希望有所幫助。

+0

允許的域沒有正確設置,我測試了代碼,蜘蛛爬過了不在列表中的twitter。 – madmed

+0

好吧,這是因爲'allow_domains'沒有爲您的規則鏈接提取器設置。我編輯了代碼 - 試試看。 – alecxe

+0

link_extractor.allow_domains是一個不是一個列表,所以我用add而不是append。一旦找到電子郵件地址,腳本不會停止抓取當前的域名,所以沒有任何改變。 – madmed

2

我最終使用process_links

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import HtmlXPathSelector 
from scrapy.item import Item, Field 
import csv 
from urlparse import urlparse 

class MailItem(Item): 
    url = Field() 
    mail = Field() 

class MailSpider(CrawlSpider): 
    name = "mail" 

    parsed_hostnames= set() 

    rules = [ 
     Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item', process_links='process_links'), 
     Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item', process_links='process_links') 
    ] 

    start_urls = [] 
    allowed_domains = [] 
    with open('scraped_data.csv', 'rb') as csvfile: 
     reader = csv.reader(csvfile, delimiter=',', quotechar='"') 
     next(reader) 
     for row in reader: 
      url = row[5].strip() 
      if (url.strip() != ""): 
       start_urls.append(url) 
       hostname = urlparse(url).hostname 
       allowed_domains.append(hostname) 

    def parse_item(self, response): 
     hxs = HtmlXPathSelector(response) 
     items = [] 
     mails = hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+') 
     if mails: 
      for mail in mails: 
       item = MailItem() 
       item['url'] = response.url 
       item['mail'] = mail 
       items.append(item) 
       hostname = urlparse(response.url).hostname 
       self.parsed_hostnames.add(hostname) 

     return items 

    def process_links(self, links): 
     return [l for l in links if urlparse(l.url).hostname not in self.parsed_hostnames] 
+0

感謝您的幫助,我找到了解決方案,訣竅是查看CrawlSpider的源代碼,並瞭解它是如何工作的。 – madmed

相關問題