2017-04-13 40 views
0

我有抓住的URL從MySQL數據庫,並使用這些網址作爲start_urls刮,進而抓住任何數量的來自刮頁面新鏈接蜘蛛。當我設置的管道均START_URL和新的刮網址中插入到一個新的數據庫或當我設的管線更新與新刮的URL使用START_URL作爲WHERE條件已經存在的數據庫,我得到一個SQL語法錯誤。Scrapy管道SQL語法錯誤

當我只插入一個或另一個,我沒有得到這個錯誤。

這裏是spider.py

import scrapy 
import MySQLdb 
import MySQLdb.cursors 
from scrapy.http.request import Request 

from youtubephase2.items import Youtubephase2Item 

class youtubephase2(scrapy.Spider): 
    name = 'youtubephase2' 

    def start_requests(self): 
     conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True) 
     cursor = conn.cursor() 
     cursor.execute('SELECT resultURL FROM SearchResults;') 
     rows = cursor.fetchall() 

     for row in rows: 
      if row: 
       yield Request(row[0], self.parse, meta=dict(start_url=row[0])) 
     cursor.close() 

    def parse(self, response): 
     for sel in response.xpath('//a[contains(@class, "yt-uix-servicelink")]'): 
      item = Youtubephase2Item() 
      item['newurl'] = sel.xpath('@href').extract() 
      item['start_url'] = response.meta['start_url'] 
      yield item 

這裏是顯示所有三個self.cursor.execute陳述

import MySQLdb 
import MySQLdb.cursors 
import hashlib 
from scrapy import log 
from scrapy.exceptions import DropItem 
from twisted.enterprise import adbapi 
from youtubephase2.items import Youtubephase2Item 

class MySQLStorePipeline(object): 
    def __init__(self): 
     self.conn = MySQLdb.connect(user='uname', passwd='password', db='YouTubeScrape', host='localhost', charset="utf8", use_unicode=True) 
     self.cursor = self.conn.cursor() 

    def process_item(self, item, spider): 
     try: 

      #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['newurl'], item['start_url'])) 
      #self.cursor.execute("""UPDATE SearchResults SET NewURL = %s WHERE ResultURL = %s""",(item['newurl'], item['start_url'])) 
      self.cursor.execute("""INSERT INTO TestResults (NewURL, StartURL) VALUES (%s, %s)""",(item['newurl'], item['start_url'])) 
      self.conn.commit() 


     except MySQLdb.Error, e: 
      log.msg("Error %d: %s" % (e.args[0], e.args[1])) 

     return item 

最上面的SQL執行語句返回此錯誤的pipeline.py:

2017-04-13 18:29:34 [scrapy.core.scraper] ERROR: Error processing {'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'} 
Traceback (most recent call last): 
File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks 
current.result = callback(current.result, *args, **kw) 
File "/root/scraping/youtubephase2/youtubephase2/pipelines.py", line 18, in process_item 
self.cursor.execute("""UPDATE SearchResults SET AffiliateURL = %s WHERE ResultURL = %s VALUES (%s, %s)""",(item['affiliateurl'], item['start_url'])) 
File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 159, in execute 
query = query % db.literal(args) 
TypeError: not enough arguments for format string 

中間SQL執行語句返回此錯誤:

2017-04-13 18:33:18 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ') WHERE ResultURL = 'https://www.youtube.com/watch?v=UqguztfQPho'' at line 1 
2017-04-13 18:33:18 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho> 
{'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'} 

即使對新數據庫使用INSERT,最後的SQL執行語句也會返回與中間相同的錯誤。似乎添加一個額外的單引號。當我只將其中一個項插入到數據庫中時,最後一個工作正常。

2017-04-13 18:36:40 [scrapy.log] INFO: Error 1064: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '), 'https://www.youtube.com/watch?v=UqguztfQPho')' at line 1 
2017-04-13 18:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.youtube.com/watch?v=UqguztfQPho> 
{'newurl': [u'http://www.tagband.co.uk/'], 
'start_url': u'https://www.youtube.com/watch?v=UqguztfQPho'} 

對不起,很長的職位。試圖徹底。

回答

0

我想通了這一點。這個問題與我將一個列表傳遞給MySQL執行管道的事實有關。

我創建了一個管道,該列表轉換爲字符串「」。加入(項目[「NEWURL」]),並返回之前打了MySQL管道的項目。

也許有更好的方法來改變spider.py中的項['newurl'] = sel.xpath('@ href')。extract()行來提取列表中的第一項或將其轉換到文本,但一條管道爲我工作。

+0

是的,有一種更習慣的方式來選擇第一個元素:'item ['newurl'] = sel.xpath('@ href')。extract_first()' –

+0

我覺得很愚蠢。我之前使用過這種方法,但沒有意識到這將是這種情況下的簡單解決方案。謝謝。 – SDailey

+0

不要覺得愚蠢。如果您沒有找到這些信息,這可能意味着選擇器文檔可以改進(假設您已經閱讀了https://docs.scrapy.org/en/latest/topics/selectors.html) –