2017-03-02 84 views
3

我有解碼json格式的問題。Python和json問題

這是我的數據。

20110312010116730|{"place":{"country_code":"US","url":"http:\/\/api.twitter.com\/1\/geo\/id\/9fbe124c83c364fe.json","bounding_box":{"type":"Polygon","coordinates":[[[-78.894441,35.03811699],[-78.85501596,35.03811699],[-78.85501596,35.08142904],[-78.894441,35.08142904]]]},"place_type":"neighborhood","name":"Downtown Fayetteville","country":"United States","attributes":{},"id":"9fbe124c83c364fe","full_name":"Downtown Fayetteville, Fayetteville"},"user":{"is_translator":false,"listed_count":9,"statuses_count":3695,"profile_link_color":"9ede14","url":"http:\/\/www.facebook.com\/nicholasd.whitehead","following":null,"verified":false,"profile_sidebar_border_color":"a7ed11","contributors_enabled":false,"profile_use_background_image":true,"friends_count":354,"profile_background_color":"131516","description":" #TEAMDROID #TAURUS #TEAMRATCHET #TEAMFITTEDS !!!! \u2752Single \u2752Taken \u2714SLiCK","profile_background_image_url":"http:\/\/a2.twimg.com\/profile_background_images\/213719493\/lime_green_logo.jpg","created_at":"Thu Jun 18 21:07:16 +0000 2009","protected":false,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1263451862\/my_shirt_off_normal.jpg","follow_request_sent":null,"time_zone":"Eastern Time (US & Canada)","favourites_count":3,"profile_text_color":"b6e82c","location":"from the 252 to the 910","name":"\u015bl\u00ef\u00e7k \u0148\u00ef\u00e7k","show_all_inline_media":false,"geo_enabled":true,"notifications":null,"profile_sidebar_fill_color":"080808","screen_name":"infamous_SLiCK","id":48490066,"id_str":"48490066","lang":"en","profile_background_tile":true,"utc_offset":-18000,"followers_count":224},"coordinates":{"type":"Point","coordinates":[-78.883968,35.052185]},"text":"i dont even know who Sam & Ronnie is !!","in_reply_to_status_id":null,"truncated":false,"source":"\u003Ca href=\"http:\/\/twidroyd.com\" rel=\"nofollow\"\u003Etwidroyd\u003C\/a\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:01:16 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[35.052185,-78.883968]},"contributors":null,"retweeted":false,"id":46450665555378176,"in_reply_to_user_id_str":null,"id_str":"46450665555378176","entities":{"urls":[],"user_mentions":[],"hashtags":[]},"retweet_count":0} 

我有更多的200GB數據爲這樣的文字。

這裏是我的代碼。

tweets_data = [] 
tweets_file = open(tweets_data_path, "r").readlines() 
for i,line in enumerate(tweets_file): 
if i%2 is 0: 
    temp = line.split('|') 
    tweet = json.loads(temp[1]) 
    #tweets_data.append(tweet) 

這是我的問題。 我試圖解碼它們。但失敗。 起初,我雖然數據中第一個出現這個錯誤。 所以我試圖分離數字和JSON數據。 但它仍然無法正常工作。 因爲不同的東西出現在我的列表中。 像這樣:

['20110312015935803', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/a1f2dacd80a51287.json","bounding_box":{"type":"Polygon","coordinates":[[[-73.002796,42.990631],[-72.866051,42.990631],[-72.866051,43.119106],[-73.002796,43.119106]]]},"place_type":"city","name":"Stratton","country":"United States","attributes":{},"id":"a1f2dacd80a51287","full_name":"Stratton, VT"},"user":{"follow_request_sent":null,"show_all_inline_media":false,"geo_enabled":true,"profile_link_color":"546080","url":"http:\\/\\/www.facebook.com\\/br.vivizanatta","following":null,"verified":false,"profile_sidebar_border_color":"bcc7e3","is_translator":false,"listed_count":0,"statuses_count":330,"profile_use_background_image":true,"profile_background_color":"2d313f","description":"Stay up to date with news, photos, videos, blog, bio and more from the brazilian journalist and photographer Vivian Zanatta.","contributors_enabled":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/211883639\\/aspen1_829_49514.jpg","created_at":"Sat Jul 10 15:05:48 +0000 2010","friends_count":79,"protected":false,"profile_image_url":"http:\\/\\/a2.twimg.com\\/profile_images\\/1259071695\\/VIVI_DDH7912_normal.jpg","time_zone":"Eastern Time (US & Canada)","favourites_count":0,"profile_text_color":"537de6","location":"Washington, DC, USA","name":"Vivi Zanatta \\u2714","notifications":null,"profile_sidebar_fill_color":"191e2a","screen_name":"vivizanatta_","id":165082798,"id_str":"165082798","lang":"en","profile_background_tile":false,"utc_offset":-18000,"followers_count":83},"coordinates":{"type":"Point","coordinates":[-72.9053683,43.1134486]},"text":"I\'m at Stratton Mountain Ski Resort (5 Village Lodge Rd, Stratton Mountain) http:\\/\\/4sq.com\\/i3ULvp","in_reply_to_status_id":null,"truncated":false,"source":"\\u003Ca href=\\"http:\\/\\/foursquare.com\\" rel=\\"nofollow\\"\\u003Efoursquare\\u003C\\/a\\u003E","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":{"type":"Point","coordinates":[43.1134486,-72.9053683]},"contributors":null,"retweeted":false,"id":46465342800797698,"in_reply_to_user_id_str":null,"id_str":"46465342800797698","entities":{"hashtags":[],"urls":[{"indices":[76,97],"url":"http:\\/\\/4sq.com\\/i3ULvp","expanded_url":null}],"user_mentions":[]},"retweet_count":0}\n'] 
['\n'] 

突然['\ n']出現。 以及我猜是因爲行被兩個['\ n']分開。 無論如何,當我使用分區,

('20110312015935977', '|', '{"place":{"country_code":"US","url":"http:\\/\\/api.twitter.com\\/1\\/geo\\/id\\/b8b87894eb3d7849.json","bounding_box":{"type":"Polygon","coordinates":[[[-95.542521,29.670631],[-95.492419,29.670631],[-95.492419,29.694855],[-95.542521,29.694855]]]},"place_type":"neighborhood","name":"Braeburn","country":"United States","attributes":{},"id":"b8b87894eb3d7849","full_name":"Braeburn, Houston"},"user":{"profile_link_color":"ed0909","url":null,"following":null,"verified":false,"profile_sidebar_border_color":"f00505","follow_request_sent":null,"show_all_inline_media":true,"geo_enabled":true,"profile_use_background_image":true,"profile_background_color":"61b8c2","description":"#TeamPlaystation #TeamLRG #TeamAquarius and #PvNation .It bring me great pleasure to welcome the real and banish the Fake...","is_translator":false,"profile_background_image_url":"http:\\/\\/a2.twimg.com\\/profile_background_images\\/179334599\\/screwston7jsredc.jpg","listed_count":0,"statuses_count":163,"created_at":"Wed Dec 08 04:04:16 +0000 2010","protected":false,"profile_image_url":"http:\\/\\/a0.twimg.com\\/profile_images\\/1256895503\\/image_normal.jpg","time_zone":"Central America","favourites_count":2,"profile_text_color":"fa0505","location":"Houston, Tx","name":"Craig Irving","contributors_enabled":false,"notifications":null,"profile_sidebar_fill_color":"020303","screen_name":"xxMinion","id":224098461,"id_str":"224098461","lang":"en","profile_background_tile":true,"utc_offset":-21600,"friends_count":36,"followers_count":35},"coordinates":null,"text":"If your White or Mexican #WhoSaidItWasOk to say \\"whats up my nigga\\" and then call your homeboys the word Nigga lol","in_reply_to_status_id":null,"truncated":false,"source":"web","favorited":false,"in_reply_to_screen_name":null,"in_reply_to_user_id":null,"created_at":"Sat Mar 12 06:59:35 +0000 2011","in_reply_to_status_id_str":null,"geo":null,"contributors":null,"retweeted":false,"id":46465343463505920,"in_reply_to_user_id_str":null,"id_str":"46465343463505920","entities":{"urls":[],"user_mentions":[],"hashtags":[{"indices":[25,40],"text":"WhoSaidItWasOk"}]},"retweet_count":0}\n') 
('\n', '', '') 

它出現。

哦,我的數據格式是gz。 如何在不解壓縮的情況下讀取python?

+0

Ohhh錯誤說ValueError:未終結的字符串從第1行第664列(字符663)開始。 對不起。這真的是晚上11點在韓國..我太累了,所以忘了它 –

回答

4

如果在你的數據|split分裂過多,json字符串被截斷

可以使用maxsplit參數

temp = line.split('|',1) 

partition

temp = line.partition('|') 

(使用在這種情況下爲temp[2],因爲分隔符也被返回)

如果還有其他問題,請考慮爲每行添加一個try/except塊,以便縮小問題的範圍。

編輯:也增加了對空行的保護作爲跟進到您的編輯。

tweets_file = open(tweets_data_path, "r") 
for i,line in enumerate(tweets_file): 
    if i%2 == 0: 
     try: 
      data = line.partition('|')[2] 
      if data:   
       tweet = json.loads(data) 
     except ValueError as e: 
      print("Cannot parse '{}'".format(data) 
      print("Error line {}: {}".format(i+1,str(e))) 
+0

問題是,當我分開他們,別的東西只是彈出每次。我剛刪除'|',然後出現列表:數字和json格式列表/「\ n」列表。這兩個列表一次又一次出現。 當我刪除'\ n'時,突然出現「''」。這讓我瘋狂 –

+1

@YooInhyeok不要恐慌,首先_isolate_問題。 –

+0

@YooInhyeok檢查我的編輯,應該有助於使您的代碼更健壯並隔離有問題的行 –