我將抓取的網址保存在Mysql數據庫中。當scrapy再次抓取網站時,如果網頁不在數據庫中,日程安排或下載程序應該只能點擊/抓取/下載頁面。不抓取數據庫中保存的網址
#settings.py
DOWNLOADER_MIDDLEWARES = {
'myproject.middlewares.RandomUserAgentMiddleware': 400,
'myproject.middlewares.ProxyMiddleware': 410,
'myproject.middlewares.DupFilterMiddleware': 390,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
# Disable compression middleware, so the actual HTML pages are cached
}
#middlewares.py
class DupFilterMiddleware(object):
def process_response(self, request, response, spider):
conn = MySQLdb.connect(user='dbuser',passwd='dbpass',db='dbname',host='localhost', charset='utf8', use_unicode=True)
cursor = conn.cursor()
log.msg("Make mysql connection", level=log.INFO)
cursor.execute("""SELECT id FROM scrapy WHERE url = %s""", (response.url))
if cursor.fetchone() is None:
return None
else:
raise IgnoreRequest("Duplicate --db-- item found: %s" % response.url)
#spider.py
class TestSpider(CrawlSpider):
name = "test_spider"
allowed_domains = ["test.com"]
start_urls = ["http://test.com/company/JV-Driver-Jobs-dHJhZGVzODkydGVhbA%3D%3D"]
rules = [
Rule(SgmlLinkExtractor(allow=("http://example.com/job/(.*)",)),callback="parse_items"),
Rule(SgmlLinkExtractor(allow=("http://example.com/company/",)), follow=True),
]
def parse_items(self, response):
l = XPathItemLoader(testItem(), response = response)
l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
l.add_xpath('job_title', '//h1/text()')
l.add_value('url',response.url)
l.add_xpath('job_description', '//tr[2]/td[2]')
l.add_value('job_code', '99')
return l.load_item()
它的工作,但我得到錯誤:從raise IgnoreRequest()
下載時出錯。它的目的是?
2013-10-15 17:54:16-0600 [test_spider] ERROR: Error downloading <GET http://example.com/job/aaa>: Duplicate --db-- item found: http://example.com/job/aaa
我的方法的另一個問題是我必須查詢我要抓取的每個網址。說,我有10k的網址爬行,這意味着我打到MySQL服務器10K次。 如何在1個MySQL查詢中執行此操作? (如獲得所有抓取的網址,並將它們存儲的地方,然後檢查請求的URL對他們)
更新:
按照audiodude的建議,這是我最新的代碼。但是,DupFilterMiddleware停止工作。它運行init但不再調用process_request。刪除_ init _將使process_request再次工作。我做錯了什麼 ?
class DupFilterMiddleware(object):
def __init__(self):
self.conn = MySQLdb.connect(user='myuser',passwd='mypw',db='mydb',host='localhost', charset='utf8', use_unicode=True)
self.cursor = self.conn.cursor()
self.url_set = set()
self.cursor.execute('SELECT url FROM scrapy')
for url in self.cursor.fetchall():
self.url_set.add(url)
print self.url_set
log.msg("DupFilterMiddleware Initialize mysql connection", level=log.INFO)
def process_request(self, request, spider):
log.msg("Process Request URL:{%s}" % request.url, level=log.WARNING)
if request.url in url_set:
log.msg("IgnoreRequest Exception {%s}" % request.url, level=log.WARNING)
raise IgnoreRequest()
else:
return None
感謝答覆。根據您的建議更新我的問題。但是,當我有'__init__'時,process_request不再運行。 –
你確定它沒有運行,或者它只是不過濾任何東西?也許你應該在「返回None」之前添加一條日誌消息。我有一種感覺,'fetchall()'返回一個列表列表。那是[[「http://url1.com」],[「http://url2.com」]],所以你的URL永遠不會在這個集合中。 – audiodude
即時100%確定。如果運行** process_request **,我應該輸出'log.msg(「Process Request URL:{%s}」%request.url,level = log.WARNING)''。相反,我什麼都沒有。如果我刪除** __ init __ **,則「處理請求URL」將顯示 –