2014-11-03 59 views
3

我無法獲得以下數據來解析Pig。這是twitter API在獲取來自某個用戶的所有推文後返回的內容。Json與Pig中的elephantbird解析

源數據:(我刪除了一些數字上的偶然任何人的隱私不侵)

[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}, {another one goes here ....} ] 

我已經嘗試了很多東西,但是這是當前代碼我有:

REGISTER 'hdfs:///user/cloudera/elephant-bird-pig-4.1.jar'; 
REGISTER 'hdfs:///user/cloudera/elephant-bird-core-4.1.jar'; 
REGISTER 'hdfs:///user/cloudera/elephant-bird-hadoop-compat-4.1.jar'; 

--Load Json 
loadJson = LOAD '/user/cloudera/tweetwall' USING com.twitter.elephantbird.pig.load.JsonLoader() AS (json:map []); 

describe loadJson; 

--dump loadJson; 

--PARSING JSON 
--txt 
--a = FOREACH loadJson GENERATE json#'text' AS ParsedInput; 

dump loadJson; 

c = FOREACH loadJson GENERATE flatten(json#'text') as (m:map[]); 

如果我沒有得到錯誤,我只是得不到回報(如腳本運行完畢後返回0字節)

例如:

success! 

Input(s): 
Successfully read 0 records (532459 bytes) from: "/user/cloudera/tweetwall" 

Output(s): 
Successfully stored 0 records in: "hdfs://quickstart.cloudera:8020/tmp/temp-988640258/tmp-9" 

Counters: 
Total records written : 0 
Total bytes written : 0 
Spillable Memory Manager spill count : 0 
Total bags proactively spilled: 0 
Total records proactively spilled: 0 
+0

一些有趣的名字,他們選擇:) – simonzack 2014-11-03 15:52:16

回答

1
1. You need to give the root name for your input json 
    I added "tweets" as your root name 
    {"tweets":[<your input>]} 

2. This is nested json, so you need to load your json file with 'nested' option in the loader 

input.json

{"tweets":[{"created_at":"Sat Nov 01 23:15:45 +0000 2014","id":5286804225,"id_str":"5286864225","text":"@Beace_ your nan makes me laugh with some of the things she comes out with","source":"\u003ca href=\"http:\/\/twitter.com\/download\/iphone\" rel=\"nofollow\"\u003eTwitter for iPhone\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":52812992878592,"in_reply_to_status_id_str":"522","in_reply_to_user_id":398098,"in_reply_to_user_id_str":"3","in_reply_to_screen_name":"Be_","user":{"id":425,"id_str":"42433395","name":"SAINS","screen_name":"sa3","location":"Lincoln","profile_location":null,"description":"","url":null,"entities":{"description":{"urls":[]}},"protected":false,"followers_count":92,"friends_count":526,"listed_count":0,"created_at":"Mon May 25 16:18:05 +0000 2009","favourites_count":6,"utc_offset":0,"time_zone":"London","geo_enabled":true,"verified":false,"statuses_count":19,"lang":"en","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"profile_background_color":"EDECE9","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme3\/bg.gif","profile_background_tile":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/52016\/DGDCj67z_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/526\/DGDCj67z_normal.jpeg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/424395\/13743515","profile_link_color":"088253","profile_sidebar_border_color":"D3D2CF","profile_sidebar_fill_color":"E3E2DE","profile_text_color":"634047","profile_use_background_image":true,"default_profile":false,"default_profile_image":false,"following":false,"follow_request_sent":false,"notifications":false},"geo":null,"coordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":1,"entities":{"hashtags":[],"symbols":[],"user_mentions":[{"screen_name":"e_","name":"\u2601\ufe0f effy","id":3998,"id_str":"398","indices":[0,15]}],"urls":[]},"favorited":false,"retweeted":false,"lang":"en"}]} 

PigScript:

REGISTER '/tmp/json-simple-1.1.jar'; 
REGISTER '/tmp/elephant-bird-hadoop-compat-4.1.jar'; 
REGISTER '/tmp/elephant-bird-pig-4.1.jar'; 

loadJson = LOAD 'input.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad') AS (json:map []); 
B = FOREACH loadJson GENERATE flatten(json#'tweets') as (m:map[]); 
C = FOREACH B GENERATE FLATTEN(m#'text'); 
DUMP C; 

Output: 
(@Beace_ your nan makes me laugh with some of the things she comes out with)