2014-03-19 79 views
0

我有加入豬的問題。我將從給你的背景開始。這裏是我的代碼:豬 - 加入不起作用

-- START file loading 
start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (PARTRANGE:chararray,  COD_IPUSER:chararray); 

-- trim 
A = FOREACH start_file GENERATE TRIM(PARTRANGE) AS PARTRANGE, TRIM(COD_IPUSER) AS COD_IPUSER; 

dump A; 

這給輸出:

(79.92.147.88,20140310) 
(79.92.147.88,20140310) 
(109.31.67.3,20140310) 
(109.31.67.3,20140310) 
(109.7.229.143,20140310) 
(109.8.114.133,20140310) 
(77.198.79.99,20140310) 
(77.200.174.171,20140310) 
(77.200.174.171,20140310) 
(109.17.117.212,20140310) 

加載其他的文件:

-- Chargement du fichier recherche Hadopi 
file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (IP_RECHERCHEE:chararray, DATE_HADO:chararray); 

dump file2; 

輸出是這樣的:

(2014/03/10 00:00:00,79.92.147.88) 
(2014/03/10 00:00:01,79.92.147.88) 
(2014/03/10 00:00:00,192.168.2.67) 

現在,我想要做一個左外連接。下面的代碼:

result = JOIN file2 by IP_RECHERCHEE LEFT OUTER, A by COD_IPUSER; 
dump result; 

輸出是這樣的:

(2014/03/10 00:00:00,79.92.147.88,,) 
(2014/03/10 00:00:00,192.168.2.67,,) 
(2014/03/10 00:00:01,79.92.147.88,,) 

所有的「文件2」的記錄都在這裏,這是很好的,但任何start_file都在這裏。這就好像加入失敗了一樣。

你知道問題在哪裏嗎?

謝謝。

回答

2

您在file2中錯誤標記了您的字段。您正在呼叫第一個字段IP,第二個字段是日期,如dump所示,情況正好相反。嘗試FOREACH file2 GENERATE IP_RECHERCHEE,您將看到您嘗試加入的字段。

1

結果如預期。您正在呼叫Left outer join,它尋找file2中的IP_RECHERCHEE字段與COD_IPUSER A的匹配。
由於沒有匹配,它會返回file2中的所有IP_RECHERCHEE字段並將null置換爲A的字段。
很明顯2014/03/10 00:00:00 != 20140310

1

你的領域的名字是錯誤的,你加入了錯誤的領域。看來你想通過IP地址加入。

start_file = LOAD 'dir/start_file.csv' USING PigStorage(';') as (IP:chararray, PARTRANGE:chararray); 

A = FOREACH start_file GENERATE TRIM(IP) AS IP, TRIM(PARTRANGE) AS PARTRANGE; 

file2 = LOAD 'dir/file2.csv' USING PigStorage(';') as (DATE_HADO:chararray, IP:chararray); 

我得到的是這樣的

(2014/03/10 00:00:00,192.168.2.67,,) 
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:00,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310) 
(2014/03/10 00:00:01,79.92.147.88,79.92.147.88,20140310)