2017-04-25 103 views
0

我想在兩個文件加入後過濾記錄。使用PIG加入後過濾數據

文件BX-Books.csv包含書籍數據。並且文件BX-Book-Ratings.csv包含書評分數據,其中ISBN是來自兩個文件的共同列。文件之間的內部連接使用此列完成。
我想獲得2002年出版的書籍。

我已經使用了下面的腳本,但我得到了0條記錄。

grunt> BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray); 
grunt> BookXRating = LOAD '/user/pradeep/BX-Book-Ratings.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray); 
grunt> BxJoin = JOIN BookXRecords BY ISBN, BookXRating BY ISBN; 
grunt> BxJoin_Mod = FOREACH BxJoin GENERATE $0 AS ISBN, $1, $2, $3, $4; 
grunt> FLTRBx2002 = FILTER BxJoin_Mod BY $3 == '2002'; 
+0

「描述BxJoin_Mod」是什麼?輸出?你是否也有2002年的YearOfPublication數據? – Amit

+0

grunt> DESCRIBE BxJoin_Mod; BxJoin_Mod:{ISBN:chararray,BookXRecords :: BookTitle:chararray,BookXRecords :: BookAuthor:chararray,BookXRecords :: YearOfPublication:chararr ay,BookXRecords :: Publisher:chararray} –

+0

是的,我的數據有YearOfPublication == 2002 –

回答

0

我創建了一個test.csv,test-rating.csv和一個Pig腳本,它們都可以工作。它工作得很好。

test.csv

1;abc;author1;2002 
2;xyz;author2;2003 

測試rating.csv

user1;1;3 
user2;2;5 

豬腳本:

A = LOAD 'test.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray); 
describe A; 
dump A; 

B = LOAD 'test-rating.csv' USING PigStorage(';') AS (user:chararray,ISBN:chararray,rating:chararray); 
describe B; 
dump B; 

C = JOIN A BY ISBN, B BY ISBN; 
describe C; 
dump C; 

D = FOREACH C GENERATE $0 as ISBN,$1,$2,$3; 
describe D; 
dump D; 

E = FILTER D BY $3 == '2002'; 
describe E; 
dump E; 

輸出:

A: {ISBN: chararray,BookTitle: chararray,BookAuthor: chararray,YearOfPublication: chararray} 
(1,abc,author1,2002) 
(2,xyz,author2,2003) 
B: {user: chararray,ISBN: chararray,rating: chararray} 
(user1,1,3) 
(user2,2,5) 
C: {A::ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray,B::user: chararray,B::ISBN: chararray,B::rating: chararray} 
(1,abc,author1,2002,user1,1,3) 
(2,xyz,author2,2003,user2,2,5) 
D: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray} 
(1,abc,author1,2002) 
(2,xyz,author2,2003) 
E: {ISBN: chararray,A::BookTitle: chararray,A::BookAuthor: chararray,A::YearOfPublication: chararray} 
(1,abc,author1,2002) 
0

要求:獲取發表在2002年前

不要求有2個數據集的書籍。 只有使用「BookXRecords」,才能實現。

grunt>BookXRecords = LOAD '/user/pradeep/BX-Books.csv' USING PigStorage(';') AS (ISBN:chararray,BookTitle:chararray,BookAuthor:chararray,YearOfPublication:chararray, Publisher:chararray,ImageURLS:chararray,ImageURLM:chararray,ImageURLL:chararray); 
grunt>A=FILTER BookXRecords BY year ='2002'; 
grunt>dump A;