2016-10-08 40 views
2

不同的元組的計數頻率我有了看起來像JSON項的文件:豬:在文件

{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"} 
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"} 
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"} 
{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua"} 
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin"} 

我要計算的文件在不同的JSON對象的頻率。我看到了其他答案,我們在Pig中使用Group By和Count()函數。我不確定我是否正確使用它們,但我沒有得到所需的結果。我的輸出應該如下所示:

{"child_pos": "NN", "parent_pos": "NN", "parent": "fighter", "child_dep": "nn", "parent_dep": "nsubj", "child": "virtua", "count": "3"} 
{"child_pos": "NN", "parent_pos": "NN", "parent": "case", "child_dep": "nn", "parent_dep": "nsubj", "child": "martin", "count": "2"} 

順序並不重要。有人可以給我一些指點嗎?

+1

請分享你已經嘗試了什麼爲什麼你認爲這不起作用? – Mzf

回答

0

這裏是可以使用的代碼,與各個領域的狀況進行groupded如果您在其他格式的想,你可以從元組讀取費爾德和使用任何其他格式

A = LOAD '/user/root/test12.json' USING JsonLoader('child_pos:chararray,    parent_pos:chararray, parent:chararray, child_dep:chararray, parent_dep:chararray, child:chararray'); 
B = GROUp A by (child_pos, parent_pos, parent, child_dep, parent_dep, child) ; 
C = FOREACH B GENERATE group, COUNT(A.child_pos) as COUNTX; 
STORE C into 'user/data/json_out.json' USING JsonStorage(); 

out put is ... 
{"group": {"child_pos":"NN","parent_pos":"NN","parent":"case","child_dep":"nn","parent_dep":"nsubj","child":"martin"},"COUNTX":2} 
{"group":{"child_pos":"NN","parent_pos":"NN","parent":"fighter","child_dep":"nn","parent_dep":"nsubj","child":"virtua"},"COUNTX":3}