2014-09-06 95 views
1

我使用Scrapy和正則表達式解析一些沒有標準的Web源代碼。然後我想解析字典的第一個元素返回:解析JSON元素

from scrapy.contrib.spiders import CrawlSpider, Rule 
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor 
from scrapy.selector import Selector 
from scrapy.item import Item 
from scrapy.spider import BaseSpider 
from scrapy import log 
from scrapy.cmdline import execute 
from scrapy.utils.markup import remove_tags 
import time 
import re 
import json 
import requests 


class ExampleSpider(CrawlSpider): 
    name = "goal2" 
    allowed_domains = ["whoscored.com"] 
    start_urls = ["http://www.whoscored.com"] 
    download_delay = 5 

    rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')] 

    def parse_item(self, response): 

     sel = Selector(response) 
     titles = sel.xpath("normalize-space(//title)") 
     print '-' * 170 
     myheader = titles.extract()[0] 
     print '********** Page Title:', myheader.encode('utf-8'), '**********' 
     print '-' * 170 

     match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \ 
        + '(\[.*\])' + re.escape(");"), response.body) 


     if match1 is not None: 
      playerdata1 = match1.group(1) 

      teamid = json.loads(playerdata1[0]) 
      print teamid 

爲「playerdata1」的第一個元素的密鑰被稱爲「TeamId」。我認爲上面的方法將工作,但我收到以下錯誤:

teamid = json.loads(playerdata1[0]) 
    File "C:\Python27\lib\json\__init__.py", line 338, in loads 
    return _default_decoder.decode(s) 
    File "C:\Python27\lib\json\decoder.py", line 366, in decode 
    obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 
    File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode 
    obj, end = self.scan_once(s, idx) 
exceptions.ValueError: Expecting object: line 1 column 1 (char 0) 

任何人都可以看到的問題是什麼嗎?

+1

你期待'match1.group(1)'是一個JSON字符串?試試'teamid = json.loads(playerdata1)[0]'而不是? – shaktimaan 2014-09-06 19:05:54

+0

這將有助於如果你至少可以給我們一個樣本網址,以測試對,一個在它的'DataStore.prime'文本。 – 2014-09-06 19:17:49

+0

@MartijnPieters好的,沒問題......這裏是一個鏈接...查看源代碼:HTTP://www.whoscored.com/Teams/32/#team-squad-stats-offensive#team-squad-stats-在這個例子中冒犯我想變量'teamid'的值等於'32',這是該頁面上團隊的ID。謝謝 – gdogg371 2014-09-06 19:19:10

回答

2

match1.group(1)返回一個字符串。然後,您可以索引字符串:

teamid = json.loads(playerdata1[0]) 

這裏,[0]會給你字符串的只是第一個字符。刪除索引表達中使用整個字符串:

teamid = json.loads(playerdata1) 

現在teamid是玩家的對象列表:

>>> len(teamid) 
22 
>>> teamid[0].keys() 
[u'FirstName', u'LastName', u'KnownName', u'Field', u'GameStarted', u'AerialWon', u'TeamRegionCode', u'SecondYellow', u'ShotsBlocked', u'TotalShots', u'Assists', u'Red', u'Name', u'PositionText', u'Ranking', u'PositionLong', u'PlayerId', u'SubOff', u'Dispossesed', u'TeamId', u'TotalTackles', u'TotalLongBalls', u'Goals', u'SubOn', u'WasDribbled', u'AerialLost', u'Turnovers', u'ShotsOnTarget', u'WSName', u'Fouls', u'ManOfTheMatch', u'Height', u'TeamName', u'RegionCode', u'TotalPasses', u'TotalThroughBalls', u'Dribbles', u'DateOfBirth', u'OwnGoals', u'WasFouled', u'TotalClearances', u'Rating', u'PlayedPositionsRaw', u'Weight', u'AccurateLongBalls', u'OffsidesWon', u'AccuratePasses', u'Yellow', u'KeyPasses', u'TotalCrosses', u'AccurateCrosses', u'IsCurrentPlayer', u'Age', u'PositionShort', u'AccurateThroughBalls', u'Interceptions', u'Offsides'] 
+0

嗨,我不知道你的意思是'刪除索引表達式來使用整個字符串'。謝謝... – gdogg371 2014-09-06 19:17:46

+0

@ user3045351:給你的網址上述作品,它爲您提供了詞典列表,每一個球員。 – 2014-09-06 19:22:19

+0

即時通訊仍然困惑,在上面的例子中,我可以如何使用上面的代碼將'TeamId'解析爲'32'。謝謝... – gdogg371 2014-09-06 19:32:52

0

,當我需要查詢小零件在一個複雜的JSON,我經常使用ObjectPath。

它看起來像CSS選擇器的查詢語言。檢查的例子在 http://adriank.github.io/ObjectPath/