2012-01-12 49 views
4

由於數據類型錯誤,我無法合計一包值。如何才能正確執行Apache Pig上的數據類型?

當我打開它的線條看起來像這樣的csv文件:

6 574 false 10.1.72.23 2010-05-16 13:56:19 +0930 fbcdn.net static.ak.fbcdn.net 304 text/css 1 /rsrc.php/zPTJC/hash/50l7x7eg.css http pwong 

使用下列內容:

logs_base = FOREACH raw_logs GENERATE 
    FLATTEN(
    EXTRACT(line, '^(\\d+),"(\\d+)","(\\w+)","(\\S+)","(.+?)","(\\S+)","(\\S+)","(\\d+)","(\\S+)","(\\d+)","(\\S+)","(\\S+)","(\\S+)"') 
) 
    as (
    account_id: int, 
    bytes: long, 
    cached: chararray, 
    ip: chararray, 
    time: chararray, 
    domain: chararray, 
    host: chararray, 
    status: chararray, 
    mime_type: chararray, 
    page_view: chararray, 
    path: chararray, 
    protocol: chararray, 
    username: chararray 
); 

所有字段似乎要加載的罰款,並用正確的類型,如圖所示由 「描述」 命令:

grunt> describe logs_base 
logs_base: {account_id: int,bytes: long,cached: chararray,ip: chararray,time: chararray,domain: chararray,host: chararray,status: chararray,mime_type: chararray,page_view: chararray,path: chararray,protocol: chararray,username: chararray} 

每當使用執行SUM:

bytesCount = FOREACH (GROUP logs_base ALL) GENERATE SUM(logs_base.bytes); 

和存儲,或轉儲內容,映射縮減過程失敗,出現以下錯誤:

org.apache.pig.backend.executionengine.ExecException: ERROR 2106: Error while computing sum in Initial 
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:87) 
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:65) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:253) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:334) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290) 
    at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:256) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:267) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:262) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long 
    at org.apache.pig.builtin.LongSum$Initial.exec(LongSum.java:79) 
    ... 15 more 

線映入我注意的是:

Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long 

這使我相信提取函數不會將字節字段轉換爲所需的數據類型(長)。

有沒有辦法強制提取功能轉換爲正確的數據類型?我如何施展價值,而不必對所有記錄進行FOREACH? (將時間轉換爲unix時間戳並嘗試查找MIN時會發生同樣的問題。我當然希望找到一個不需要不必要投影的解決方案)。

任何指針將不勝感激。非常感謝你的幫助。

問候, 豪爾赫C.

P.S.我正在亞馬遜彈性地圖縮小服務上以交互模式運行它。

回答

8

您是否嘗試過cast從UDF檢索到的數據?這裏應用模式不會執行任何轉換。

例如

logs_base = 
    FOREACH raw_logs 
    GENERATE 
     FLATTEN(
      (tuple(LONG,LONG,CHARARRAY,....)) EXTRACT(line, '^...') 
     ) 
     AS (account_id: INT, ...); 
+0

感謝羅曼,這工作完美。 我的印象是,應用模式會指導豬對隱式轉換。 我想知道爲什麼在http://aws.amazon.com/articles/2729上的教程中有沒有在加載的數據上顯式強制轉換的聚合SUM函數... 再次感謝。 – mindonaut 2012-01-16 01:24:31