使用熊貓閱讀用於Python分析的JSON文件

我遇到了一些問題，試圖在我的Python編輯器中加載JSON文件，以便我可以對其中的數據執行一些分析。使用熊貓閱讀用於Python分析的JSON文件

JSON文件是在以下文件夾：'C:\Users\Admin\JSON files\file1.JSON'

它包含以下鳴叫數據（這只是一個紀錄，有數百個在那裏）：

{ 
    "created": "Fri Mar 13 18:09:33 GMT 2014", 
    "description": "Tweeting the latest Playstation news!", 
    "favourites_count": 4514, 
    "followers": 235, 
    "following": 1345, 
    "geo_lat": null, 
    "geo_long": null, 
    "hashtags": "", 
    "id": 2144411414, 
    "is_retweet": false, 
    "is_truncated": false, 
    "lang": "en", 
    "location": "", 
    "media_urls": "", 
    "mentions": "", 
    "name": "Playstation News", 
    "original_text": null, 
    "reply_status_id": 0, 
    "reply_user_id": 0, 
    "retweet_count": 4514, 
    "retweet_id": 0, 
    "score": 0.0, 
    "screen_name": "SevenPS4", 
    "source": "<a href=\"http://twitterfeed.com\" rel=\"nofollow\">twitterfeed</a>", 
    "text": "tweetinfohere", 
    "timezone": "Amsterdam", 
    "url": null, 
    "urls": "http://bit.ly/1lcbBW6", 
    "user_created": "2013-05-19", 
    "user_id": 13313, 
    "utc_offset": 3600 
}

我使用以下代碼嘗試和測試這個數據：

import json 
import pandas as pa 
z = pa.read_json('C:\Users\Admin\JSON files\file1.JSON') 
d = pa.DataFrame.from_dict([{k:v} for k,v in z.iteritems() if k in ["retweet_count", "user_id", "is_retweet"]]) 
print d.retweet_count.sum()

當我運行它，它成功地讀取JSON文件，然後打印出retweet_count的像這樣的列表：

0, 4514 1, 300 2, 450 3, 139等等等等

我的問題：實際上，我怎麼總結了所有的retweet_count/USER_ID值，而不是剛剛上市的上面顯示他們喜歡什麼？

然後我如何將這個總數除以輸入的數量來得到平均值？

如何選擇JSON數據的樣本大小而不是全部使用？（我認爲這是d.iloc [：10]但不起作用）

通過JSON文件中的'is_retweet'字段，是否可以計算假/真假的數量給定？在JSON文件中的IE中，我想要轉發推文的數量和沒有的推文數量。

在此先感謝，是的，我是很新，這個..

z.info()給出：

<class 'pandas.core.frame.DataFrame'> Int64Index: 506 entries, 0 to 505 Data columns (total 31 columns): created 506 non-null object description 506 non-null object favourites_count 506 non-null int64 followers 506 non-null int64 following 506 non-null int64 geo_lat 10 non-null float64 geo_long 10 non-null float64 hashtags 506 non-null object id 506 non-null int64 is_retweet 506 non-null bool is_truncated 506 non-null bool lang 506 non-null object location 506 non-null object media_urls 506 non-null object mentions 506 non-null object name 506 non-null object original_text 172 non-null object reply_status_id 506 non-null int64 reply_user_id 506 non-null int64 retweet_id 506 non-null int64 retweet_count 506 non_null int64 score 506 non-null int64 screen_name 506 non-null object source 506 non-null object status_count 506 non-null int64 text 506 non-null object timezone 415 non-null object url 273 non-null object urls 506 non-null object user_created 506 non-null object user_id 506 non-null int64 utc_offset 506 non-null int64 dtypes: bool(2), float64(2), int64(11), object(16)

爲什麼它顯示了當我運行d.info retweet_count和USER_ID爲對象（）？

來源

2014-04-02 user1745447

df.info（）將列顯示爲非空對象，當我假設它們必須是值時，對不對？我如何將它們更改爲值而不是對象？' Int64Index：2個條目，0〜1個數據列（總2列）： retweet_count 1非空對象 USER_ID 1非空對象 dtypes：對象（2）' – user1745447

什麼是'的數據類型z'？ – myacobucci

檢查我的編輯底部@myacobucci – user1745447

d.retweet_count是你的retweet_counts字典表的正確錯誤？

所以得到的總和：

keys = d.retweet_count.keys() 
sum = 0 
for items in keys: 
    sum+=d.retweet_count[items]

爲了獲得平均：

avg = sum/len(keys)

我們獲得的樣本大小剛剛瓜分keys：

sample_keys = keys[0:10]

得到意思是

for items in sample_keys: 
    sum+=d.retweet_count[items] 
avg = sum/len(sample_keys)

來源

2014-04-02 13:48:29 myacobucci

我認爲當我使用下面一行'd = pa.DataFrame.from_dict（[{k：v} for k，v in z.iteritems（）if in [「retweet_count」，「user_id」，「is_retweet」] ]）'它把我的值列變成了對象，所以我不能在它們上面運行總和/平均值/等等。將在一秒內嘗試樣本量，謝謝。 – user1745447

當您運行'print d.retweet_count'時，確切的輸出是什麼？ – myacobucci

我現在已經修復了這個問題，謝謝，現在嘗試使用JSON數據樣本而不是全部樣本。我有'z = pa.read_json（'C：\ Users \ Admin \ JSON files \ file1.JSON'）'，但我希望使用retweet_count的100/200的樣本，以便我可以找到mean/max /等差異大小的樣本。我嘗試過'keys = z.retweet_count.keys（） sample_keys = keys [：200]'，然後用'sample_keys.mean'調用平均值，但它不工作，任何想法@myacobucci謝謝 – user1745447

使用熊貓閱讀用於Python分析的JSON文件

回答

相關問題