2016-07-17 27 views
0

我正在使用pig-0.16.0 我試圖使用豬腳本加入兩個製表符分隔文件(.tsv)。一些列字段是整數類型,所以我試圖加載它們爲int。但是我看到,無論我創建'int'的列都沒有加載數據,它們顯示爲空。我的連接沒有輸出任何結果,所以我退後一步,發現這個問題發生在加載步驟。我在這裏貼上我的豬腳本的一部分:加載int數據類型時,apache pig輸出null值

REGISTER /usr/local/pig/lib/piggybank.jar; 
-- $0 = streaminputs/forum_node.tsv 
-- $1 = streaminputs/forum_users.tsv 
u_f_n = LOAD '$file1' USING PigStorage('\t') AS (id: long, title: chararray, tagnames: chararray, author_id: long, body: chararray, node_type: chararray, parent_id: long, abs_parent_id: long, added_at: chararray, score: int, state_string: chararray, last_edited_id: long, last_activity_by_id: long, last_activity_at: chararray, active_revision_id: int, extra:chararray, extra_ref_id: int, extra_count:int, marked: chararray); 

LUFN = LIMIT u_f_n 10; 

STORE LUFN INTO 'pigout/LN'; 

u_f_u = LOAD '$file2' USING PigStorage('\t') AS (author_id: long, reputation: chararray, gold: chararray, silver: chararray, bronze: chararray); 

LUFUU = LIMIT u_f_u 10; 

STORE LUFUU INTO 'pigout/LU'; 

我試着用長,但還是同樣的問題,只是chararray似乎在這裏工作。那麼,這可能是什麼問題?從兩個輸入.tsv格式文件

摘錄:

forum_nodes.tsv:

"id" "title" "tagnames" "author_id" "body" "node_type" "parent_id" "abs_parent_id" "added_at" "score" "state_string" "last_edited_id" "last_activity_by_id" "last_activity_at" "active_revision_id" "extra" "extra_ref_id" "extra_count" "marked" 
"5339" "Whether pdf of Unit and Homework is available?" "cs101 pdf" "100000458" "" "question" "\N" "\N" "2012-02-25 08:09:06.787181+00" "1" "" "\N" "100000921" "2012-02-25 08:11:01.623548+00" "6922" "\N" "\N" "204" "f" 

forum_users.tsv:

"user_ptr_id" "reputation" "gold" "silver" "bronze" 
"100006402" "18" "0" "0" "0" 
"100022094" "6354" "4" "12" "50" 
"100018705" "76" "0" "3" "4" 
"100021176" "213" "0" "1" "5" 
"100045508" "505" "0" "1" "5" 
+0

我建議[編輯](http://stackoverflow.com/posts/38421717/編輯)您的問題添加輸入文件的一小部分,以便其他用戶可以嘗試並重現該問題(另請參閱[MCVE](http://stackoverflow.com/help/mcve))。 – lfurini

+0

尋找數據共享的問題數據是字符串,因爲它是引用即「18」是字符串chararray ... –

回答

0

您需要更換的報價,讓豬知道它int或否則它將顯示空白。您應該使用CSVLoader OR CSVExcelStorage,看到我的測試:

示例文件:

"1","test" 

測試1 - 使用CSVLoader:

您可以使用CSVLoader或CSVExcelStorage如果你想忽略報價 - 看到example here

PIG命令:

register '/usr/lib/pig/piggybank.jar' ; 
define CSVLoader org.apache.pig.piggybank.storage.CSVLoader(); 
file1 = load 'file1.txt' using CSVLoader(',') as (f1:int, f2:chararray); 

輸出:

(1,test) 

試驗2 - 更換雙引號:

PIG命令:

file1 = load 'file1.txt' using PigStorage(','); 
data = foreach file1 generate REPLACE($0,'\\"','') as (f1:int) ,$1 as (f2:chararray); 

輸出:

(1,"test") 

測試3 - 使用數據,因爲它是:

PIG命令:

file1 = load 'file1.txt' using PigStorage(',') as (f1:int, f2:chararray); 

輸出:

(,"test") 
+0

哇感謝BigDataLearner的提示。 –