2017-10-20 53 views
0

我對網絡抓取非常陌生,並且在從nba.com抓取一些NBA球員數據時遇到了一些麻煩。我首先試圖用bs4來刮頁面,但碰到一個問題,經過一些研究後,我認爲這是由於我閱讀的文章中的「XHR」。我能夠找到json格式數據的網址,但我的python程序似乎陷入了困境,並且從未加載數據。再次,我在網絡抓取方面很新穎,但是我想我會看看我是否在這裏偏離軌道......有什麼建議嗎?謝謝! (下面的代碼)Scrapping json網頁

import requests 
import json 

url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=" 

resp = requests.get(url=url) 
data = json.loads(resp.text) 
print(data) 
+0

爲什麼不看圖書館來幫忙? https://github.com/seemethere/nba_py或至少看看他們是如何做到的? – corn3lius

+0

還沒有找到,謝謝你看! – johankent30

回答

1

給這個一杆。它將根據我定義的標題生成該頁面中的所有類別。順便說一句,你沒有得到第一個與你的初始嘗試的反應,因爲網頁預計在您的請求User-Agent,以確保請求不是來自機器人,而是來自任何真正的瀏覽器。但是,我僞造了它並找到了答案。

import requests 

url = "http://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight=" 
resp = requests.get(url,headers={'User-Agent':'Mozilla/5.0'}) 
data = resp.json() 

storage = data['resultSets'] 
for elem in storage: 
    all_list = elem['rowSet'] 

    for item in all_list: 
     Player_Id = item[0] 
     Player_name = item[1] 
     Team_Id = item[2] 
     Team_abbr = item[3] 
     print("Player_Id: {} Player_name: {} Team_Id: {} Team_abbr: {}".format(
      Player_Id,Player_name,Team_Id,Team_abbr)) 
+0

我試着用這個url的方法:「http:// www .enciclovida.mx/explora-por-region/especies-por-grupo?utf8 = \ xe2 \ x9c \ x93&grupo_id = Plantas&region_id =&parent_id =&pagina =&nombre =「 我總是得到一個500,關於如何適應它的想法? –

+0

我無法填充任何錯誤。它仍然有效。 – SIM

+0

mmm我試着設置一個region_id,我得到的結果回來了,但是當它們進入頁面(pagina =)時,我只能得到前10個,它們應該大於500頁;我看到籃球榜上的所有數據都在同一頁。任何提示? –

0

就意識到,這是因爲用戶代理頭是不同的......一旦這些被添加它的工作原理

+1

你也可以直接使用r.json(),如圖所示[這裏](http://docs.python-requests.org/en/master/) – Thecave3