Python Scrapy - 從mysql中填充start_urls

我正在嘗試使用spider.py從MYSQL表中使用SELECT填充start_url。當我運行「scrapy runspider spider.py」時，我得不到任何輸出，只是它沒有錯誤而已。Python Scrapy - 從mysql中填充start_urls

我已經測試了python腳本中的SELECT查詢，並且start_url得到了來自MYSQL表的entrys的填充。

spider.py

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 
import MySQLdb 


class ProductsSpider(BaseSpider): 
    name = "Products" 
    allowed_domains = ["test.com"] 
    start_urls = [] 

    def parse(self, response): 
     print self.start_urls 

    def populate_start_urls(self, url): 
     conn = MySQLdb.connect(
       user='user', 
       passwd='password', 
       db='scrapy', 
       host='localhost', 
       charset="utf8", 
       use_unicode=True 
       ) 
     cursor = conn.cursor() 
     cursor.execute(
      'SELECT url FROM links;' 
      ) 
    rows = cursor.fetchall() 

    for row in rows: 
     start_urls.append(row[0]) 
    conn.close()

來源

2013-11-21 maryo

更好的方法是覆蓋start_requests方法。

這可以查詢您的數據庫，很像populate_start_urls，並返回一個Request對象的序列。

你只需要你的populate_start_urls方法重命名爲start_requests和修改以下行：

for row in rows: 
    yield self.make_requests_from_url(row[0])

來源

2013-11-22 04:43:19

三江源的響應。它的工作原理，我只需要將'def populate_start_urls（self，url）：'改成'def start_requests（self）：'。我已將此標記爲已接受，因爲它與我發佈的代碼最接近。 – maryo

如果你有22M網站進行廣泛搜索，你怎麼能做到這一點？我想你必須一次迭代1000次。你能說明如何使用start_requests迭代它嗎？ –

寫在__init__的填充：

def __init__(self): 
    super(ProductsSpider,self).__init__() 
    self.start_urls = get_start_urls()

假設get_start_urls()回報的URL。

來源

2013-11-21 15:20:22 Biswanath

Python Scrapy - 從mysql中填充start_urls

回答

相關問題