2017-07-09 85 views
0

我正在編寫一個Scrapy爬行器,我希望它將數據發送到數據庫。但是我不能讓它工作,也許是因爲管線。這裏是我的蜘蛛:Scrapy:連接到MySQL

from scrapy.contrib.spiders import CrawlSpider 
from scrapy.selector import Selector 
from scrapy.http import Request 

class YourCrawler(CrawlSpider): 
    name = "bookstore" 
    start_urls = [ 
    'https://example.com/materias/?novedades=LC&p', 
    ] 
    allowed_domains = ["example.com"] 

    def parse(self, response): 
     # go to the urls in the list 
     s = Selector(response) 
     page_list_urls = s.xpath('///*[@id="results"]/ul/li/div[1]/h4/a[2]/@href').extract() 
     for url in page_list_urls: 
      yield Request(response.urljoin(url), callback=self.parse_following_urls, dont_filter=True) 

     # Return back and go to bext page in div#paginat ul li.next a::attr(href) and begin again 
     next_page = response.css('div#paginat ul li.next a::attr(href)').extract_first() 
     if next_page is not None: 
      next_page = response.urljoin(next_page) 
      yield Request(next_page, callback=self.parse) 

    #Dont know if this has to go here 
    if not s.select('//*[@id="logo"]/a/img'): 
     yield Request(url=response.url, dont_filter=True) 

    # For the urls in the list, go inside, and in div#main, take the div.ficha > div.caracteristicas > ul > li 
    def parse_following_urls(self, response): 
     #Parsing rules go here 
     for each_book in response.css('div#main'): 
      yield { 
       'book_isbn': each_book.css('div.ficha > div.caracteristicas > ul > li').extract(), 
      } 
    custom_settings = { 
     "DOWNLOAD_DELAY": 5, 
     "CONCURRENT_REQUESTS_PER_DOMAIN": 2 
    } 

而且我希望它的數據發送到數據庫,所以在pypelines.py我

import pymysql 
from scrapy.exceptions import DropItem 
from scrapy.http import Request 

class to_mysql(object): 
    def __init__(self): 
     self.connection = pymysql.connect("***","***","***","***", charset="utf8", use_unicode=True) 
     self.cursor = self.connection.cursor() 

    def process_item(self, item, spider): 
     self.cursor.execute("INSERT INTO _b (book_isbn) VALUES (%s)", (item['book_isbn'].encode('utf-8'))) 
     self.connection.commit() 
     return item 

    def close_spider(self, spider): 
     self.cursor.close() 
     self.connection.close() 

而且在settings.py

ITEM_PIPELINES = { 
    'bookstore.pipelines.BookstorePipeline': 300, 
    'bookstore.pipelines.to_mysql': 300, 
} 

如果我在settings.py中激活管道«to_mysql»,它不起作用,並返回此回溯:

Traceback (most recent call last): 
    File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks 
    current.result = callback(current.result, *args, **kw) 
    File "/Users/***/scrapy/bookstore/bookstore/pipelines.py", line 27, in process_item 
    self.cursor.execute("INSERT INTO _b (book_isbn) VALUES (%s)", (item['book_isbn'].encode('utf-8'))) 
AttributeError: 'list' object has no attribute 'encode' 
2017-07-09 16:19:48 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com/book/?id=9788416495412> (referer: https://example.com/materias/?novedades=LC&p) 
2017-07-09 16:19:48 [scrapy.core.scraper] ERROR: Error processing {'book_isbn': [u'<li>Editorial: <a href="/search/avanzada/?go=1&amp;editorial=Galaxia%20Gutenberg">Galaxia Gutenberg</a></li>', u'<li>P\xe1ginas: 325</li>', u'<li>A\xf1o: 2017</li>', u'<li>Precio: 21.90 \u20ac</li>', u'<li>Traductor: Pablo Moreno</li>', u'<li>EAN: 9788416495412</li>']} 

有關爲什麼會發生這種情況的任何想法?

+0

你能發佈蜘蛛的日誌嗎? –

+0

剛剛添加了追蹤! – Nikita

回答

1

這是因爲您正在返回book_isbn字段的列表值,因爲.extract()返回一個列表,並且列表不能編碼到sql查詢中。

您必須對該值進行序列化,或者您不想要列表,在這種情況下請使用extract_first()

+0

謝謝!我想存儲books_isbn的每個值。我怎樣才能做到這一點? – Nikita

+0

這取決於你想如何將它存儲在你的db – eLRuLL

+0

hm,你是什麼意思?想象一下,我想將每個book_isbn存儲在一張表格中,我該如何做到這一點? – Nikita