0
在pig中使用此查詢從CSV文件中加載數據,其中包含50000條記錄。CSV將大量數據加載到Pig中
A = LOAD '/home/user/q2.csv' using org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'YES_MULTILINE') as (Id:chararray,
PostTypeId:chararray,
AcceptedAnswerId:chararray,
ParentId:chararray,
CreationDate:chararray,
DeletionDate:chararray,
Score:chararray,
ViewCount:chararray,
Body:chararray,
OwnerUserId:chararray,
OwnerDisplayName:chararray,
LastEditorUserId:chararray,
LastEditorDisplayName:chararray,
LastEditDate:chararray,
LastActivityDate:chararray,
Title:chararray,
Tags:chararray,
AnswerCount:chararray,
CommentCount:chararray,
FavoriteCount:chararray,
ClosedDate:chararray,
CommunityOwnedDate:chararray);
這裏是清理\ n &的數據,在體內場和一些更多的查詢。
Q2Clean = FOREACH Q2 GENERATE
Id as Id,
PostTypeId as PostTypeId,
AcceptedAnswerId as AcceptedAnswerId,
(chararray)REPLACE(ParentId,'"','') as ParentId,
CreationDate as CreationDate,
(chararray)REPLACE(DeletionDate,'"','') as DeletionDate,
Score as Score,
ViewCount as ViewCount,
(chararray)REPLACE(REPLACE(Body,'\n',''),',','')as Body,
OwnerUserId as OwnerUserId,
(chararray)REPLACE(OwnerDisplayName,'"','') as OwnerDisplayName,
LastEditorUserId as LastEditorUserId,
(chararray)REPLACE(LastEditorDisplayName,'"','') as LastEditorDisplayName,
LastEditDate as LastEditDate,
LastActivityDate as LastActivityDate,
(chararray)REPLACE(Title,',','') as Title,
(chararray)REPLACE(Tags,',','') as Tags,
AnswerCount as AnswerCount,
CommentCount as CommentCount,
FavoriteCount as FavoriteCount,
(chararray)REPLACE(ClosedDate,'"','') as ClosedDate,
(chararray)REPLACE(CommunityOwnedDate,'"','') as CommunityOwnedDate;
現在的問題是,當我存儲輸出其顯示617538行寫入。它創建了兩個文件。第一個文件有27000條記錄格式正確,但第二個文件存儲不正確。它包含大約610000行和許多行,只有它。如何正確加載數據,以便輸出顯示50000而不是617538行。
我厭倦了另一個反斜槓替換\ n,但仍然顯示了相同數量的記錄。 – user6118910
@ user6118910你可以發佈樣本數據嗎? –