2017-02-02 121 views
0

條款我試圖像NOT IN豬

select * from A where A.ID NOT IN (select id from B) (in sql) 

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
c= FOREACH destnew GENERATE ID; 
D=FILTER sourcenew BY NOT ID (c.ID); 
org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1. 
Was expecting one of: 
<EOF> 
"cat" ... 
"clear" ...<EOF> 

任何幫助這個來解決錯誤,得到這個在最後一行的執行。

+0

想一想由ID分組2間的關係,過濾出這些不具有匹配 – 54l3d

回答

1

使用LEFT OUTER JOIN和過濾空

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray); 
c = FOREACH destnew GENERATE ID; 
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID; 
e = FILTER d by destnew.ID is null; 

注意 我寫了一個示例腳本與夫婦的測試文件以下是工作solution.In你情況的檢查,看看如果要加載從您的文件正確的數據。

test1.txt的

1 abc 
2 def 
3 ghi 
4 jkl 
5 mno 
6 pqr 
7 stu 
8 vwx 
1 abc 
2 def 
3 ghi 
4 jkl 
1 abc 
2 def 
3 ghi 
1 abc 
2 def 

的test2.txt

1 
2 
3 
4 

腳本

A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray); 
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int); 
C = JOIN A BY aid LEFT OUTER,B BY bid; 
D = FILTER C BY bid is null; 
DUMP D; 

因此,在上面的例子中RECO rds 5,6,7,8應該在結果中,因爲這些Ids不在test2.txt中。

Output

+0

ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1066:無法打開迭代器別名d。後端錯誤:org.apache.pig.backend.executionengine.ExecException:錯誤0:標量在輸出中有多個行。第一:(1),第二:(2)(常見原因:「JOIN」,然後「FOREACH ... GENERATE foo.bar」應該是「foo :: bar」)@inquisitive_mind – Vickyster

+0

我甚至試過d = FILTER sourcenew BY NOT(sourcenew.ID == c.ID); – Vickyster

+0

@Vickyster,我已經編輯了答案,並且還包含了一個例子。希望有幫助。 –