2017-05-16 60 views
0

集分析我有一個電影數據庫中的以下數據集:電影數據使用PIG

評分:用戶ID,MovieID,評級::電影:MovieID,標題::用戶:用戶ID,性別,年齡

現在,我必須加入上述3個數據集,並確定哪部電影在女性中評分最高,男性中評分最低,反之亦然。 我也做了JOIN:

myusers = LOAD '/user/cloudera/movies/input/users.dat' 
    USING PigStorage(':') 
    AS (user:int, n1, gender:chararray, n2, age:int); 

ratings = LOAD '/user/cloudera/movies/input/ratings.dat' 
    USING PigStorage(':') 
    AS (user:int, n1, movie:int, n2, rating:int); 

movies = LOAD '/user/cloudera/movies/input/movies.dat' 
    USING PigStorage(':') 
    AS (movie:int,n1,title:chararray); 

data = JOIN ratings BY user, myusers BY user; 
data2= JOIN data BY ratings::movie, movies BY movie; 

但畢竟這我遇到了許多問題,如「ERROR 0:標有在輸出多行」,當我嘗試從數據2打印列。任何想法來幫助我完成這項任務?

回答

0

以下步驟後

data = JOIN ratings BY user, myusers BY user; 

利用性別作爲filter.Order數據集建立兩個數據集一個爲男性,另一個爲女性,並得到最大和最小兩個數據集。

male = FILTER data by gender == 'M'; -- Use the gender value for male 
female = FILTER data by gender == 'F'; 
m_max = LIMIT (ORDER male by rating DESC) 1; 
f_max = LIMIT (ORDER female by rating DESC) 1; 
m_min = LIMIT (ORDER male by rating ASC) 1; 
f_min = LIMIT (ORDER female by rating ASC) 1;