我這個頁面上:http://www.metacritic.com/browse/games/title/ps4/a?view=condensed網頁抓取規則創建
而且我想進入的每個項目,並得到開發商和流派,但我的代碼似乎並沒有工作。
例如,我想進入這個頁面:http://www.metacritic.com/game/playstation-4/angry-birds-star-wars
然後離開它,繼續完成剩下的做同樣的,添加到數據庫中。我可以在代碼中更改哪些內容以使其工作?現在數據庫的開發和流派爲空,但它得到的數據的其餘部分,以便它就像它從來沒有進入parse_Game
我也加入打印語句到parseGame他們都不打印
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
from metacritic.items import MetacriticItem
import MySQLdb
import re
from string import lowercase
class MetacriticSpider(BaseSpider):
def start_requests(self):
#iterate through ps4 pages
for c in lowercase:
for i in range(self.max_id):
yield Request('http://www.metacritic.com/browse/games/title/ps4/{0}?page={1}'.format(c, i), callback = self.parseps4)
#gets the developer and genre of a game
def parseGame(self, response):
print("Here")
item = response.meta['item']
db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []
item['dev'] = site.xpath('.//span[contains(@class, "summary_detail developer")]/span[1]/text()').extract()
item['genre'] = site.xpath('.//span[contains(@class, "summary_detail product_genre")]/span[1]/text()').extract()
cursor.execute("INSERT INTO ps4 (dev, genre) VALUES (%s,%s)",[item['dev'][0],item['genre'][0]])
items.append(item)
print item['dev']
print item['genre']
def parseps4(self, response):
#some local variables
db1 = MySQLdb.connect("localhost", "root", "andy", "metacritic")
cursor = db1.cursor()
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="product_wrap"]')
items = []
#iterates through each site
for site in sites:
with db1:
item = MetacriticItem()
#sets the item
item['title'] = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/text()').extract()
item['cscore'] = site.xpath('.//div[contains(@class, "basic_stat product_score brief_metascore")]/div[1]/text()').extract()
item['uscore'] = site.xpath('.//div/ul/li/span[contains(@class, "data textscore")]/text()').extract()
item['release'] = site.xpath('.//li[contains(@class, "stat release_date full_release_date")]/span[2]/text()').extract()
#some processing to check if there is a score attached, if there is, it adds it to the database
if ("tbd" in item['cscore'][0] and "tbd" not in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" in item['uscore'][0]) or ("tbd" not in item['cscore'][0] and "tbd" not in item['uscore'][0]):
cursor.execute("INSERT INTO ps4 (title, criticalscore, userscore, releasedate) VALUES (%s,%s,%s, %s)",[(' '.join(item['title'][0].split())).replace("(PS4)","",1),item['cscore'][0],item['uscore'][0],item['release'][0]])
items.append(item)
itemLink = site.xpath('.//div[contains(@class, "basic_stat product_title")]/a/@href').extract()
req = Request('http://www.metacritic.com' + itemLink[0], callback = self.parseGame)
req.meta['item'] = item
看起來你忘了把'yield'放在'Request('http://www.metacritic.com'+ itemLink [0],callback = self.parseGame)'之前。 – alecxe
@alecxe我試過這個,不幸的是它不工作。任何其他想法? – AndyOHart
還有至少一個問題。在'parseGame'中'item'沒有被定義。您需要將'item'從'parseps4'傳遞到'meta'中的'parseGame':請參閱http://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request。 – alecxe