我有一些Scrapy代碼,使用正則表達式來搜索網站以查找包含我的數據的字典形式的一些非標準源代碼尋找。當發現這個數據被打印到屏幕上。exceptions.ValueError:期望的屬性名稱:第1行第3列(char 2)
包含用戶看到的此數據的表具有多個選項卡。當用戶在標籤之間移動時,XHR請求刷新後臺數據。代碼的第二部分試圖打印字典返回時,從「整體」到「首頁」標籤下頁的用戶移動:
http://www.whoscored.com/Teams/32/
的代碼是在這裏:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body) #regex to match inital data item
if match1 is not None:
playerdata1 = match1.group(1) #if match1 isnt empty then print the dictionary embedded in the source code of the page
print '**********Players by team (Summary - Overall):**********'
print '-' * 170
for player in json.loads(playerdata1):
print ("{TeamId},{PlayerId},{Name}".decode().format(**player))
#submit xhr request to obtain the dictionary that contains the 'Home' data, rather than the 'Overall' data embedded in the source code.
url = 'http://www.whoscored.com/stageplayerstatfeed'
params = {
'field': '1',
'isAscending': 'false',
'orderBy': 'Rating',
'playerId': '-1',
'stageId': '9155',
'teamId': '32'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/Teams/32/'}
response = requests.get(url, params=params, headers=headers)
fixtures = response.json()
print '**********Players by team (Summary - Home):**********'
print '-' * 170
for player in json.loads(fixtures): #print 'Home' dictionary here:
print ("{TeamId},{PlayerId},{Name}".decode().format(**player))
execute(['scrapy','crawl','goal2'])
此代碼拋出一個錯誤,指出應該使用字符串或緩衝區。當我試圖轉換變量「燈具」的字符串中的語句for player in json.loads(fixtures):
在使用之前,我得到一個錯誤說:
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
exceptions.ValueError: Expecting property name: line 1 column 3 (char 2)
我假設的錯誤是相對於聲明.decode().format(**player))
,但我我不確定這需要改變。
任何人都可以幫忙嗎?
感謝
'fixtures'是一個Python對象了。爲什麼你將元素傳遞給'json.loads()'**再次**? – 2014-09-06 13:32:32