1
我有一個蜘蛛,它從MySQL數據庫中讀取start_urls並從每個頁面中刪除未知數量的鏈接。我想使用pipelines.py來更新數據庫,但我不知道如何讓start_url回到SQL UPDATE語句的管道中。Scrapy管道爲每個start_url更新mysql
這是蜘蛛代碼的作品。
import scrapy
import MySQLdb
import MySQLdb.cursors
from scrapy.http.request import Request
from youtubephase2.items import Youtubephase2Item
class youtubephase2(scrapy.Spider):
name = 'youtubephase2'
def start_requests(self):
conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True)
cursor = conn.cursor()
cursor.execute('SELECT resultURL FROM SearchResults;')
rows = cursor.fetchall()
for row in rows:
if row:
yield Request(row[0], self.parse)
cursor.close()
def parse(self, response):
for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'):
item = Youtubephase2Item()
item['pageurl'] = sel.xpath('@href').extract()
yield item
而這裏就是我要更新數據庫的鏈接使用START_URL作爲SQL UPDATE語句的WHERE條件刮pipeline.py。所以SQL語句中的start_url是我想要完成的佔位符。
import MySQLdb
import MySQLdb.cursors
import hashlib
import re
from scrapy import log
from scrapy.exceptions import DropItem
from twisted.enterprise import adbapi
from youtubephase2.items import Youtubephase2Item
class MySQLStorePipeline(object):
def __init__(self):
self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=true)
self.cursor = self.conn.cursor()
def process_item(self, item, spider):
try:
self.cursor.execute("""UPDATE SearchResults SET PageURL = %s WHERE ResultURL = start_url[
VALUES (%s)""",
(item['pageurl']
))
self.conn.commit()
except MySQLdb.Error, e:
log.msg("Error %d: %s" % (e.args[0], e.args[1]))
return item
希望我的問題很清楚。過去,我成功地使用了pipeline.py來將項插入到數據庫中。
感謝這麼多的快速回復。這工作完美! – SDailey