2016-06-14 43 views
0

嘗試使用打開細化來分析凌亂JSON字符串(40k行)的數據集,但是由於JSON的無序性質,一些JSON對象的行在返回並記錄到文件時被混淆了。嚴格使用JSON,如何將鍵值重新排序爲特定的JSON模式以進行打開細化

某些對象缺少鍵,某些對象的順序不正確。例如:

1 {"about":"foo", "category":"bar", "id":"123", "cat_list": ["category1":"foo2"]} 
2 {"id":"22","about":"barFoo", "category":"NotABar"} 
3 {"about":"barbar", "category":"website", "id":"3333", "cat_list": ["category1":"foo22"]} 
.... 
.... 
.... 
40,000 {"about":"bar123", "category":"publish", "id":"3323", "cat_list": ""} 

ISSUE:

的數據導入打開提純,程序要求一個特定的模式進行比較,以當它讀取該文件。然後它讀取提供的文件,將線上的每個JSON對象與模式以及導入或放棄進行比較,具體取決於它與模式的匹配程度!結果許多條目被排除在外!

理想的情況:

使用Python,我想重新排序的JSON對象到我指定一個特定的模式。

例子:

指定模式

{"about":"", "category":"", "id":"", "cat_list": ""} 

然後將重新排列JSON和它的鍵值的每一行是在這個特定的格式:

1 {"about": .... 
2 {"about": .... 
3 {"about": .... 
.... 
.... 
.... 
40,000 {"about": .... 

我不完全確定我如何有效地做到這一點?

編輯:

我決定寫一個腳本來組織這個。我刪除了一些複雜的字段並且有一個完整的.JSON文件:

{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater 
"}, 
{"name":"Febreze", 
"category":"Product/Service 
", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}, 
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS 
", 
"city":"Middle Sackville"}, 
{"name":"The Hunger Games", 
"category":"Movie 
", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}, 

然而, Google-Refine仍然拒絕接受我的文件?我做錯了什麼?

+0

對象在JSON沒有內在順序,只有陣列做。 – Barmar

+0

您的'cat_list'值不是有效的JSON。數組不能包含像這樣的'key:value'對。在40,000行上,該值是一個字符串而不是數組,可能違反了模式。我認爲你遇到的問題與這些問題有關,而不是對象中元素的順序。 – Barmar

+0

正如@Barmar所說,您的問題可能不是訂購相關的。 ...但是如果你使用的是普通的'json'模塊,那麼它只是按照dict.items()/ dict.iteritems()提供的順序排序鍵,除非你讓它排序。你可以使用一個collections.OrderedDict來記住插入順序,或者製作一個字典封裝器,它可以按你想要的順序返回鍵。 – Wuggy

回答

0

不知道你是否解決了這個問題。

JSON在成功導入之前需要有效 - 目前,您在上述Q中發佈的文本無法通過諸如http://jsonlint.com之類的工具進行驗證。

您在進口這OpenRefine(又名谷歌瑞風)方面具有的問題是JSON對象必須是在一個數組:

[{"name":"Carstar Bridgewater", 
"category":"Automotive", 
"about":"We are Bridgewaters largest professional collision centre and are committed to being there for customer cars and communities when they need us.", 
"country":"Canada", 
"state":"NS", 
"city":"Bridgewater"}, 
{"name":"Febreze", 
"category":"Product/Service", 
"about":"Freshness that eliminates odorsso you can breathe happy.", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}, 
{"name":"Custom Wood & Acrylic Turnings", 
"category":"Professional Services", 
"about":"Hand crafted item turned on a wood lath pen pencil bottle stopper cork screw bottle opener perfume applicator or other custom turnings", 
"country":"Canada", 
"state":"NS", 
"city":"Middle Sackville"}, 
{"name":"The Hunger Games", 
"category":"Movie", 
"about":"THE HUNGER GAMES: MOCKINGJAY - PART 1 - In theatres November 2 2014. www.hungergamesmovie.ca", 
"country":"Added Nothing", 
"state":"Added Nothing", 
"city":"Added Nothing"}] 

這裏張貼到OpenRefine我可以成功導入該JSON它工作正常 - 截圖:

enter image description here enter image description here

+0

這真的很奇怪,但我給了你答案,即使我決定編寫一個工具來將我的大json文件轉換爲csv :) – Rob

0

「將數據導入Open Refine,程序會要求特定模式與讀取文件時進行比較。」

這聽起來像是意外地將它檢測爲XML rathar而不是JSON甚至是Lines。

但是,您可以選擇您希望使用的導入程序(如Line based或JSON),而不僅僅是OpenRefine嘗試猜測並自動選取的導入程序。

我的眼睛,它看起來像你可能會與新的即將到來的「JSON行」或「換行分隔的JSON」的格式來處理,如記錄在這裏:http://jsonlines.org/

我們有開放的添加JSON行的問題最終支持OpenRefine:https://github.com/OpenRefine/OpenRefine/issues/1135

與此同時,請查看On the Web at the jsonlines.org site部分以獲得查找工具支持以幫助您滿足您的需求。

+0

薩德嗨,請看看我的編輯。 – Rob