看起來你錯過了一個加入語句,它會加入MovieID列上的兩個數據集(評分&電影)。我嘲笑了一些測試數據,並在下面提供了一些示例代碼。
movie_avg.pig
ratings = LOAD 'movie_ratings.txt' USING PigStorage(',') AS (user_id:chararray, movie_id:chararray, rating:int);
movies = LOAD 'movie_data.txt' USING PigStorage(',') AS (movie_id:chararray,genre:chararray);
movies_filter = FILTER movies BY (genre MATCHES '.*Action.*' OR genre MATCHES '.*War.*');
movies_join = JOIN movies_filter BY movie_id, ratings BY movie_id;
movies_cleanup = FOREACH movies_join GENERATE movies_filter::movie_id AS movie_id, ratings::rating as rating;
movies_group = GROUP movies_cleanup by movie_id;
data = FOREACH movies_group GENERATE group, AVG(movies_cleanup.rating);
dump data;
movie_avg.pig的輸出
(Jarhead,3.0)
(Platoon,4.333333333333333)
(Die Hard,3.0)
(Apocolypse Now,4.5)
(Last Action Hero,2.0)
(Lethal Weapon, 4.0)
movie_data.txt
Scrooged,Comedy
Apocolypse Now,War
Platoon,War
Guess Whos Coming To Dinner,Drama
Jarhead,War
Last Action Hero,Action
Die Hard,Action
Lethal Weapon,Action
My Fair Lady,Musical
Frozen,Animation
movie_ratings.txt
12345,Scrooged,4
12345,Frozen,4
12345,My Fair Lady,5
12345,Guess Whos Coming To Dinner,5
12345,Platoon,3
12345,Jarhead,2
23456,Platoon,5
23456,Apocolypse Now,4
23456,Die Hard,3
23456,Last Action Hero,2
34567,Lethal Weapon,4
34567,Jarhead,4
34567,Apocolypse Now,5
34567,Platoon,5
34567,Frozen,5
非常感謝..這似乎解決它:) – Maddy
不客氣!樂意效勞 :) – JamCon