我正在使用Scrapy找到一個學校項目,以查找死鏈接和缺頁。我已經編寫了管道,用於寫入帶有相關刮取信息的文本文件。我在計算蜘蛛運行結束時如何發送電子郵件以及作爲附件創建的文件時遇到了麻煩。抓取網站後發送帶附件的電子郵件
Scrapy具有內置的電子郵件功能,並觸發信號時,蜘蛛完成,但在某種程度上,這是明智的是躲避我一起得到的一切。任何幫助將不勝感激。
這裏是我的創建與刮數據文件管道:
class saveToFile(object):
def __init__(self):
# open files
self.old = open('old_pages.txt', 'wb')
self.date = open('pages_without_dates.txt', 'wb')
self.missing = open('missing_pages.txt', 'wb')
# write table headers
line = "{0:15} {1:40} {2:} \n\n".format("Domain","Last Updated","URL")
self.old.write(line)
line = "{0:15} {1:} \n\n".format("Domain","URL")
self.date.write(line)
line = "{0:15} {1:70} {2:} \n\n".format("Domain","Page Containing Broken Link","URL of Broken Link")
self.missing.write(line)
def process_item(self, item, spider):
# add items to file as they are scraped
if item['group'] == "Old Page":
line = "{0:15} {1:40} {2:} \n".format(item['domain'],item["lastUpdated"],item["url"])
self.old.write(line)
elif item['group'] == "No Date On Page":
line = "{0:15} {1:} \n".format(item['domain'],item["url"])
self.date.write(line)
elif item['group'] == "Page Not Found":
line = "{0:15} {1:70} {2:} \n".format(item['domain'],item["referrer"],item["url"])
self.missing.write(line)
return item
我想發送的電子郵件創建一個單獨的管道項目。我至今如下:
class emailResults(object):
def __init__(self):
dispatcher.connect(self.spider_closed, spider_closed)
dispatcher.connect(self.spider_opened, spider_opened)
old = open('old_pages.txt', 'wb')
date = open('pages_without_dates.txt', 'wb')
missing = open('missing_pages.txt', 'wb')
oldOutput = open('twenty_oldest_pages.txt', 'wb')
attachments = [
("old_pages", "text/plain", old)
("date", "text/plain", date)
("missing", "text/plain", missing)
("oldOutput", "text/plain", oldOutput)
]
self.mailer = MailSender()
def spider_closed(SPIDER_NAME):
self.mailer.send(to=["[email protected]"], attachs=attachments, subject="test email", body="Some body")
看來,在Scrapy以前的版本,你可以通過自成spider_closed功能,但在目前的版本(0.21)的spider_closed功能僅通過蜘蛛名。
任何幫助和/或建議將不勝感激。
謝謝你的建議,非常有幫助。 – bornytm