2013-12-12 87 views
1

我正在使用Scrapy找到一個學校項目,以查找死鏈接和缺頁。我已經編寫了管道,用於寫入帶有相關刮取信息的文本文件。我在計算蜘蛛運行結束時如何發送電子郵件以及作爲附件創建的文件時遇到了麻煩。抓取網站後發送帶附件的電子郵件

Scrapy具有內置的電子郵件功能,並觸發信號時,蜘蛛完成,但在某種程度上,這是明智的是躲避我一起得到的一切。任何幫助將不勝感激。

這裏是我的創建與刮數據文件管道:

class saveToFile(object): 

def __init__(self): 
    # open files 
    self.old = open('old_pages.txt', 'wb') 
    self.date = open('pages_without_dates.txt', 'wb') 
    self.missing = open('missing_pages.txt', 'wb') 

    # write table headers 
    line = "{0:15} {1:40} {2:} \n\n".format("Domain","Last Updated","URL") 
    self.old.write(line) 

    line = "{0:15} {1:} \n\n".format("Domain","URL") 
    self.date.write(line) 

    line = "{0:15} {1:70} {2:} \n\n".format("Domain","Page Containing Broken Link","URL of Broken Link") 
    self.missing.write(line) 

def process_item(self, item, spider): 

    # add items to file as they are scraped 
    if item['group'] == "Old Page": 
     line = "{0:15} {1:40} {2:} \n".format(item['domain'],item["lastUpdated"],item["url"]) 
     self.old.write(line) 
    elif item['group'] == "No Date On Page": 
     line = "{0:15} {1:} \n".format(item['domain'],item["url"]) 
     self.date.write(line) 
    elif item['group'] == "Page Not Found": 
     line = "{0:15} {1:70} {2:} \n".format(item['domain'],item["referrer"],item["url"]) 
     self.missing.write(line) 

    return item 

我想發送的電子郵件創建一個單獨的管道項目。我至今如下:

class emailResults(object): 

def __init__(self): 

    dispatcher.connect(self.spider_closed, spider_closed) 
    dispatcher.connect(self.spider_opened, spider_opened) 

    old = open('old_pages.txt', 'wb') 
    date = open('pages_without_dates.txt', 'wb') 
    missing = open('missing_pages.txt', 'wb') 
    oldOutput = open('twenty_oldest_pages.txt', 'wb') 

attachments = [ 
      ("old_pages", "text/plain", old) 
      ("date", "text/plain", date) 
      ("missing", "text/plain", missing) 
      ("oldOutput", "text/plain", oldOutput) 
     ] 

     self.mailer = MailSender() 
def spider_closed(SPIDER_NAME): 

    self.mailer.send(to=["[email protected]"], attachs=attachments, subject="test email", body="Some body") 

看來,在Scrapy以前的版本,你可以通過自成spider_closed功能,但在目前的版本(0.21)的spider_closed功能僅通過蜘蛛名。

任何幫助和/或建議將不勝感激。

回答

3

創建郵件發送類作爲管道是不是最好的主意。更好地創建它作爲你自己的擴展。你可以在這裏閱讀更多關於擴展:http://doc.scrapy.org/en/latest/topics/extensions.html

最重要的部分是類方法from_crawler。它呼籲所有的爬蟲,而且它可以爲您要攔截信號註冊您的回調。 例如從我的郵件類這個功能看起來是這樣的:

@classmethod 
def from_crawler(cls, crawler): 
    recipients = crawler.settings.getlist('STATUSMAILER_RECIPIENTS') 
    if not recipients: 
     raise NotConfigured 

    mail = MailSender.from_settings(crawler.settings) 
    instance = cls(recipients, mail, crawler) 

    crawler.signals.connect(instance.item_scraped, signal=signals.item_scraped) 
    crawler.signals.connect(instance.spider_error, signal=signals.spider_error) 
    crawler.signals.connect(instance.spider_closed, signal=signals.spider_closed) 
    crawler.signals.connect(instance.item_dropped, signal=signals.item_dropped) 

    return instance 

爲了方便使用,記得設置的所有必需數據設置:

EXTENSIONS = { 
    'your.mailer': 80 
} 

STATUSMAILER_RECIPIENTS = ["who should get mail"] 

MAIL_HOST = '***' 
MAIL_PORT = *** 
MAIL_USER = '***' 
MAIL_PASS = '***' 
+0

謝謝你的建議,非常有幫助。 – bornytm