0
我測試過它的瓶頸。它來自中間服裝的選擇查詢。Scrapy中間件的瓶頸MySQL select
class CheckDuplicatesFromDB(object):
def process_request(self, request, spider):
# url_list is a just python list. some urls in there.
if (request.url not in url_list):
self.crawled_urls = dict()
connection = pymysql.connect(host='123',
user='123',
password='1234',
db='123',
charset='utf8',
cursorclass=pymysql.cursors.DictCursor)
try:
with connection.cursor() as cursor:
# Read a single record
sql = "SELECT `url` FROM `url` WHERE `url`=%s"
cursor.execute(sql, request.url)
self.crawled_urls = cursor.fetchone()
connection.commit()
finally:
connection.close()
if(self.crawled_urls is None):
return None
else:
if (request.url == self.crawled_urls['url']):
raise IgnoreRequest()
else:
return None
else:
return None
如果我在setting.py
禁用DOWNLOADER_MIDDLEWEARS
,scrapy爬行速度還不錯。
禁用之前:
scrapy.extensions.logstats] INFO:爬4頁(在0頁/分鐘),刮4項(在2項/分鐘)
關閉後:
[scrapy.extensions.logstats] INFO:爬55頁(在55頁/分鐘),刮0的項目(在0件/分鐘)
我猜選擇查詢是問題。所以,我想選擇查詢一次,並獲得一個網址數據,把請求finger_prints
。
我正在使用CrawlerProcess:蜘蛛越多,抓取的頁面越少/分鐘。
實施例:
- 1蜘蛛=> 50頁/分鐘
- 2蜘蛛=>總30頁/分鐘
- 6蜘蛛=>總共10頁/分鐘
我想要做的是:
- 從MySQL獲取一個url數據
- 把URL數據來請求
finger_prints
我怎樣才能做到這一點?