我使用Scrapy和正則表達式解析一些沒有標準的Web源代碼。然後我想解析字典的第一個元素返回:解析JSON元素
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
sel = Selector(response)
titles = sel.xpath("normalize-space(//title)")
print '-' * 170
myheader = titles.extract()[0]
print '********** Page Title:', myheader.encode('utf-8'), '**********'
print '-' * 170
match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body)
if match1 is not None:
playerdata1 = match1.group(1)
teamid = json.loads(playerdata1[0])
print teamid
爲「playerdata1」的第一個元素的密鑰被稱爲「TeamId」。我認爲上面的方法將工作,但我收到以下錯誤:
teamid = json.loads(playerdata1[0])
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
exceptions.ValueError: Expecting object: line 1 column 1 (char 0)
任何人都可以看到的問題是什麼嗎?
你期待'match1.group(1)'是一個JSON字符串?試試'teamid = json.loads(playerdata1)[0]'而不是? – shaktimaan 2014-09-06 19:05:54
這將有助於如果你至少可以給我們一個樣本網址,以測試對,一個在它的'DataStore.prime'文本。 – 2014-09-06 19:17:49
@MartijnPieters好的,沒問題......這裏是一個鏈接...查看源代碼:HTTP://www.whoscored.com/Teams/32/#team-squad-stats-offensive#team-squad-stats-在這個例子中冒犯我想變量'teamid'的值等於'32',這是該頁面上團隊的ID。謝謝 – gdogg371 2014-09-06 19:19:10