2017-10-21 33 views
0

我有以下命令搶在UNIX一個JSON:正則表達式與多個管道JSON文件

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json 

哪個(每次顯然有不同的結果)給了我下面的輸出格式:

{ 
"kind": "...", 
"data": { 
"modhash": "", 
"whitelist_status": "...", 
"children": [ 
e1, 
e2, 
e3, 
... 
], 
"after": "...", 
"before": "..." 
} 
} 

其中陣列的兒童中的每個元素是結構化的作爲對象如下:

{ 
"kind": "...", 
"data": { 
... 
} 
} 

這裏是一個前充足完整的上傳.json的get(車身太長,直接發佈: https://pastebin.com/20p4kk3u

我需要打印完整的數據對象數組孩子的每一個元素中的存在。我知道我需要管ATLEAST兩次,最初得到那裏的孩子[...],然後數據{...},這是我到目前爲止有:

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)' 

我是新來的正則表達式,所以我不知道如何處理括號或大括號內的元素我正在grepping。上面的行沒有打印任何東西,我不知道爲什麼。任何幫助表示讚賞。

+2

你開到使用第三方的事業嗎?我通常使用jq二進制來輕鬆解析json數據。根據您的要求,您只需將json數據傳遞給具有內部查詢語言的jq即可:cat/tmp/data | jq'.data.children | 。[]'(這裏/ tmp/data包含完整的json)。通過使用這些實用程序,您實際上可以使用較短的查詢和高級功能(如原始輸出,查詢等)完成工作。 – akskap

+0

那麼,獲取數據的最終目標不是唯一的目標;這一次恰好是一個.json格式,但我想知道如何通過正則表達式來處理任何文件。 –

回答

1

代碼

wget -q -O- https://www.reddit.com/r/NetflixBestOf/.json | tr -d '\r\n' | grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' | grep -oP '"data"\s*:\s*\K({.+?})(?=\s*},)' 

一些關於正則表達式

* == zero or more time 
+ == one or more time 
? == zero or one time 
\s == a space character or a tab character or a carriage return character or a new line character or a vertical tab character or a form feed character 
\w == is a word character and can to be from A to Z (upper or lower), from 0 to 9, included also underscore (_) 
\d == all numbers from 0 to 9 
\r == carriage return 
\n == new line character (line feed) 
\ == escape special characters so they can to be read as normal characters 
[...] == search for character class. Example: [abc] search for a or b or c 
(?=) == is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. 
\K == match start at this position. 

反正你可以閱讀更多關於正則表達式從這裏:Regex Tutorial

現在我可以試着解釋代碼

wget download the source. 
tr remove all line feed e carriage return, so we have all the output in one line and can to be handle from grep. 
grep -o option is used for only matching. 
grep -P option is for perl regexp. 

So here 
grep -oP '"children"\s*:\s*\[\s*\K({.+?})(?=\s*\])' 
we have sayed: 
match all the line from "children" 
zero or more spaces 
: 
zero or more spaces 
\[ escaped so it's a simple character and not a special 
zero or more spaces 
\K force submatch to start from here 
(submatch 
{.+?} all, in braces (the braces are included because after start submatch sign. See greedy, not greedy in the regex tutorial for understand how work .+?) 
) close submatch 
(?=\s*\]) stop submatch when zero or more space founded and simple ] is founded but not include it in the submatch. 
+0

感謝您的詳細解釋,非常有幫助。後續問題,如果使用egrep而不使用perl regex語法,會有什麼區別? –

+0

看看這裏:https://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions –

1

如果你想得到兒童陣列試試這個,但我不知道這是你在找什麼。

wget -O - https://www.reddit.com/r/NetflixBestOf/.json | sed -n '/children/,/],/p'