我寫了一個scrapy蜘蛛,它有很多start_urls並在這些url中提取電子郵件地址。該腳本需要很長時間才能執行,因此我想告訴Scrapy在發現電子郵件並移至下一個網站時停止抓取特定網站。如何辨別python scrapy移動到下一個起始URL
編輯:添加的代碼
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
import csv
from urlparse import urlparse
from entreprise.items import MailItem
class MailSpider(CrawlSpider):
name = "mail"
start_urls = []
allowed_domains = []
with open('scraped_data.csv', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter=',', quotechar='"')
next(reader)
for row in reader:
url = row[5].strip()
if (url.strip() != ""):
start_urls.append(url)
fragments = urlparse(url).hostname.split(".")
hostname = ".".join(len(fragments[-2]) < 4 and fragments[-3:] or fragments[-2:])
allowed_domains.append(hostname)
rules = [
Rule(SgmlLinkExtractor(allow=('.+')), follow=True, callback='parse_item'),
Rule(SgmlLinkExtractor(allow=('.+')), callback='parse_item')
]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
for mail in hxs.select('//body//text()').re(r'[\w.-][email protected][\w.-]+'):
item = MailItem()
item['url'] = response.url
item['mail'] = mail
items.append(item)
return items
能否請您出示蜘蛛的代碼?這將有助於回答。 – alecxe