網頁抓取規則創建

我這個頁面上：http://www.metacritic.com/browse/games/title/ps4/a?view=condensed 網頁抓取規則創建

而且我想進入的每個項目，並得到開發商和流派，但我的代碼似乎並沒有工作。

例如，我想進入這個頁面：http://www.metacritic.com/game/playstation-4/angry-birds-star-wars

然後離開它，繼續完成剩下的做同樣的，添加到數據庫中。我可以在代碼中更改哪些內容以使其工作？現在數據庫的開發和流派爲空，但它得到的數據的其餘部分，以便它就像它從來沒有進入parse_Game

我也加入打印語句到parseGame他們都不打印

from scrapy.spider import BaseSpider 
from scrapy.selector import Selector 
from scrapy.http import Request 
from scrapy.selector import HtmlXPathSelector 
from metacritic.items import MetacriticItem 
import MySQLdb 
import re 
from string import lowercase 

class MetacriticSpider(BaseSpider): 
def start_requests(self): 
    #iterate through ps4 pages 
    for c in lowercase: 
     for i in range(self.max_id): 
      yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback = self.parseps4) 

    #gets the developer and genre of a game 
def parseGame(self, response): 

    print("Here") 

    item = response.meta['item'] 

    db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic") 
    cursor = db1.cursor() 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//div[@class="product_wrap"]') 
    items = [] 

    item['dev'] = site.xpath('.//span[contains(@class, "summary_detail developer")]/span[1]/text()').extract() 
    item['genre'] = site.xpath('.//span[contains(@class, "summary_detail product_genre")]/span[1]/text()').extract()  

    cursor.execute("INSERT INTO ps4 (dev, genre) VALUES (%s,%s)",[item['dev'][0],item['genre'][0]]) 
    items.append(item) 

    print item['dev'] 
    print item['genre'] 

def parseps4(self, response): 
    #some local variables 
    db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic") 
    cursor = db1.cursor() 
    hxs = HtmlXPathSelector(response) 
    sites = hxs.select('//div[@class="product_wrap"]') 
    items = [] 

    #iterates through each site 
    for site in sites: 
     with db1: 
      item = MetacriticItem() 

      #sets the item 
      item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract() 
      item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract() 
      item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract() 
      item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract() 

      #some processing to check if there is a score attached, if there is, it adds it to the database 
      if ("tbd" in item['cscore'][0] and "tbd" not in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" not in item['uscore'][0]): 
       cursor.execute("INSERT INTO ps4 (title, criticalscore, userscore, releasedate) VALUES (%s,%s,%s, %s)",[(' '.join(item['title'][0].split())).replace("(PS4)","",1),item['cscore'][0],item['uscore'][0],item['release'][0]]) 
       items.append(item) 

      itemLink = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href').extract() 

      req = Request('http://www.metacritic.com' + itemLink[0], callback = self.parseGame) 
      req.meta['item'] = item

來源

2014-04-01 AndyOHart

看起來你忘了把'yield'放在'Request（'http://www.metacritic.com'+ itemLink [0]，callback = self.parseGame）'之前。 – alecxe

@alecxe我試過這個，不幸的是它不工作。任何其他想法？ – AndyOHart

還有至少一個問題。在'parseGame'中'item'沒有被定義。您需要將'item'從'parseps4'傳遞到'meta'中的'parseGame'：請參閱http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request。 – alecxe

在代碼中的幾個問題：

元參數應包含一個字典{'item': item}
HtmlXPathSelector已被棄用 - 使用Selector而不是
我認爲你不應該這樣做，MySQL將插入蜘蛛內 - 使用數據庫管道代替：
- Writing items to a MySQL database in Scrapy
你需要得到的第一個項目extract()呼叫並做就可以了strip()（此將有助於在該領域的字符串，而不是名單和無開頭和結尾的空格和換行）

下面的代碼沒有mysql相關電話：

from string import lowercase 

from scrapy.item import Field, Item 
from scrapy.spider import BaseSpider 
from scrapy.http import Request 
from scrapy.selector import HtmlXPathSelector, Selector 

from metacritic.items import MetacriticItem 


class MetacriticSpider(BaseSpider): 
    name = 'metacritic' 
    allowed_domains = ['metacritic.com'] 

    max_id = 1 # your max_id value goes here!!! 

    def start_requests(self): 
     for c in lowercase: 
      for i in range(self.max_id): 
       yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback=self.parseps4) 

    def parseGame(self, response): 
     item = response.meta['item'] 
     hxs = HtmlXPathSelector(response) 
     site = hxs.select('//div[@class="product_wrap"]') 

     # get additional data!!! 

     yield item 

    def parseps4(self, response): 
     hxs = Selector(response) 
     sites = hxs.select('//div[@class="product_wrap"]') 
     for site in sites: 
      item = MetacriticItem() 
      item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()[0].strip() 
      item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()[0].strip() 
      item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()[0].strip() 
      item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()[0].strip() 

      link = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href').extract()[0] 
      yield Request('http://www.metacritic.com/' + link, meta={'item': item}, callback=self.parseGame)

它適用於我 - 我在控制檯上看到從parseGame()產生的項目。

確保先收到物品，然後查看!!!註釋 - 相應地填寫下列行。

之後，如果您在控制檯上看到項目，請嘗試創建數據庫管道將項目寫入到mysql。

來源

2014-04-01 20:55:50 alecxe

非常感謝這位男士的幫助，感謝你編寫所有的代碼！我已經嘗試過，但我仍然無法在控制檯中看到關於parseGame的任何內容。我把打印（「你好」）在parseGame看它是否進入它，但它不打印任何東西。我怎樣才能確保它產生物品？ – AndyOHart

@AndyOHart hm，如果你把印刷品放入'start_requests'和'parseps4'中 - 你看到它們了嗎？ – alecxe

是的，我確實看到那些。在ps4parse中打印標題，並在start_requests中將其打印在Start中。 – AndyOHart

網頁抓取規則創建

回答

相關問題