2016-07-28 48 views
0

根據Im挖掘的項目類型,我颳了一個具有不同行的網站。我有一個看起來像下面的1st blockcode的工作刮刀,但是,我希望能夠從數據庫中取出一個類型並從start_requests(self)發送到解析函數。我有11種不同的類型,在頁面的某個部分,所有的表都有不同的行數,而頁面上其他表中的其餘行是相同的。我試圖顯示第二個代碼中的代碼。Scrapy發送條件從start_requests(自我)解析

我該如何在start_requests中從數據庫中取出類型並將其發送給解析?

1 blockcode

# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider 
from scrapy.selector import Selector 
from scrapeInfo.items import infoItem 
import pyodbc 


class scrapeInfo(Spider): 
name = "info" 
allowed_domains = ["http://www.nevermind.com"] 
start_urls = [] 

def start_requests(self): 

    #Get infoID and Type from database 
    self.conn = pyodbc.connect('DRIVER={SQL Server};SERVER=server;DATABASE=dbname;UID=user;PWD=password') 
    self.cursor = self.conn.cursor() 
    self.cursor.execute("SELECT InfoID FROM dbo.infostage") 

    rows = self.cursor.fetchall() 

    for row in rows: 
     url = 'http://www.nevermind.com/info/' 
     yield self.make_requests_from_url(url+row[0]) 

def parse(self, response): 
    hxs = Selector(response) 
    infodata = hxs.xpath('div[2]/div[2]') # input item path 

    itemPool = [] 

    InfoID = ''.join(response.url) 
    id = InfoID[29:len(InfoID)-1]   


    for info in infodata: 
     item = infoItem() 

     # Details 
     item['id'] = id #response.url 
     item['field'] = info.xpath('tr[1]/td[2]/p/b/text()').extract() 
     item['field2'] = info.xpath('tr[2]/td[2]/p/b/text()').extract() 
     item['field3'] = info.xpath('tr[3]/td[2]/p/b/text()').extract() 
     item['field4'] = info.xpath('tr[4]/td[2]/p/b/text()').extract() 
     item['field5'] = info.xpath('tr[5]/td[2]/p/b/text()').extract() 
     item['field6'] = info.xpath('tr[6]/td[2]/p/b/text()').extract() 


     itemPool.append(item) 
     yield item 
    pass 

第二blockcode
這是不行的,但是我不知道如何得到它的工作。我是否創建一個全局列表,一個新功能?

# -*- coding: utf-8 -*- 
from scrapy.spiders import Spider 
from scrapy.selector import Selector 
from scrapeInfo.items import infoItem 
import pyodbc 


class scrapeInfo(Spider): 
name = "info" 
allowed_domains = ["http://www.nevermind.com"] 
start_urls = [] 

def start_requests(self): 

    #Get infoID and Type from database 
    self.conn = pyodbc.connect('DRIVER={SQL Server};SERVER=server;DATABASE=dbname;UID=user;PWD=password') 
    self.cursor = self.conn.cursor() 
    self.cursor.execute("SELECT InfoID, type FROM dbo.infostage") 

    rows = self.cursor.fetchall() 

    for row in rows: 
     url = 'http://www.nevermind.com/info/' 
     type = row[1] # how do I send this value to the parse function? 
     yield self.make_requests_from_url(url+row[0]) 

def parse(self, response): 
    hxs = Selector(response) 
    infodata = hxs.xpath('div[2]/div[2]') # input base path 

    itemPool = [] 

    InfoID = ''.join(response.url) 
    id = InfoID[29:len(InfoID)-1]   


    for info in infodata: 
     item = infoItem() 

     # Details 
     item['id'] = id #response.url 

     # Here I need to implement a condition that comes from def start_requests(self). 
     # If condition meet then scrape the following fields else the next 
     if type = 'type1': 
# This is where I would like to use it. 
# I have 11 different types, that all have different number of rows for one table on some part of the page, whereas the rest of the rows in the other tables on the page are the same. 
     # Type 1 
      item['field'] = info.xpath('tr[1]/td[2]/p/b/text()').extract() 
      item['field2'] = info.xpath('tr[2]/td[2]/p/b/text()').extract() 
      item['field3'] = info.xpath('tr[3]/td[2]/p/b/text()').extract() 
      item['field4'] = info.xpath('tr[4]/td[2]/p/b/text()').extract() 
      item['field5'] = info.xpath('tr[5]/td[2]/p/b/text()').extract() 
      item['field6'] = info.xpath('tr[6]/td[2]/p/b/text()').extract() 
     else: 
      item['field2'] = info.xpath('tr[2]/td[2]/p/b/text()').extract() 
      item['field4'] = info.xpath('tr[4]/td[2]/p/b/text()').extract() 
      item['field6'] = info.xpath('tr[6]/td[2]/p/b/text()').extract() 

     itemPool.append(item) 
     yield item 
    pass 


謝謝大家的幫助和見解!

回答

2

您可以使用request.meta

def make_requests_from_url(self, url, type, callback): 
    request = scrapy.Request(url, callback) 
    request.meta['type'] = type 
    return request 

parse您可以使用訪問typeresponse.meta['type']