2013-07-10 30 views
0

我想從包含數據類型UUID字段的二進制形式(例如BinData(3,「/ qHWF5hGQU + w6unYcTQxWw ==」))的Mongo集合加載數據。作業失敗豬MongoLoader異常加載數據與UUID

org.apache.pig.backend.executionengine.ExecException: ERROR 2108: \ 
    Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b". 

我建立了mongo-hadoop版本1.1(來自Master分支)。 https://github.com/mongodb/mongo-hadoop。它工作正常,除非有UUID。以下是我的腳本和錯誤。有任何想法嗎?

register '/pig/lib/mongo-java-driver-2.9.3.jar'; 
register '/pig/lib/mongo-hadoop-core_cdh4.3.0-1.1.0.jar'; 
register '/pig/lib/mongo-hadoop-pig_cdh4.3.0-1.1.0.jar'; 
a = LOAD 'mongodb://localhost/TestDb.SocialUser' 
     USING com.mongodb.hadoop.pig.MongoLoader(); 
store a INTO 'a'; 

2013-07-10 15:03:35,630 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - Map task executor complete. 
2013-07-10 15:03:35,632 [Thread-6] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local402930066_0001 
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 2108: Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:404) 
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2108: \ 
    Could not determine data type of field: 1423ed53-5064-0000-784b-7bf2e2dd837b 
    at org.apache.pig.impl.util.StorageUtil.putField(StorageUtil.java:208) 
    at org.apache.pig.impl.util.StorageUtil.putField(StorageUtil.java:166) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat$PigLineRecordWriter.write(PigTextOutputFormat.java:68) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigTextOutputFormat$PigLineRecordWriter.write(PigTextOutputFormat.java:44) 
    at org.apache.pig.builtin.PigStorage.putNext(PigStorage.java:296) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98) 
    at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:558) 
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:85) 
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:106) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:264) 
    at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:140) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:672) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:266) 
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 
    at java.util.concurrent.FutureTask.run(FutureTask.java:166) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
    at java.lang.Thread.run(Thread.java:724) 
2013-07-10 15:03:39,235 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure. 

回答

2

的MongoLoader具有被用於由所述記錄讀取器返回到一個類型,它是用豬兼容的類型轉換的方法convertBSONtoPigType。如果類型不是被識別的類型 - 即包含java.util.Date,那麼該方法默認爲輸出對象並破壞豬。

如果向mongo加載器添加一個模式,該模式爲UUID提供char數組的pig數據類型,例如

使用MongoLoader(myguid:chararray)加載'/mongodb://mongoserver/db.collection'對象的底層java代碼調用.toString()(在本例中爲java.util.UUID)並將輸出一個正常的UUID。

你也可以切實地改變convertBSONtoPigType方法來做同樣的事情,例如,

public static Object convertBSONtoPigType(final Object o) throws ExecException { 
    if (o == null) { 
     return null; 
    } else if (o instanceof Number || o instanceof String) { 
     return o; 
    } else if (o instanceof Date) { 
     return ((Date) o).getTime(); 
    } else if (o instanceof ObjectId) { 
     return o.toString(); 
    } else if (o instanceof UUID) { 
     return o.toString(); 
    } 
    else if (o instanceof BasicBSONList) { 
     BasicBSONList bl = (BasicBSONList) o; 
     Tuple t = tupleFactory.newTuple(bl.size()); 
     for (int i = 0; i < bl.size(); i++) { 
      t.set(i, convertBSONtoPigType(bl.get(i))); 
     } 
     return t; 
    } else if (o instanceof Map) { 
     //TODO make this more efficient for lazy objects? 
     Map<String, Object> fieldsMap = (Map<String, Object>) o; 
     HashMap<String, Object> pigMap = new HashMap<String, Object>(fieldsMap.size()); 
     for (Map.Entry<String, Object> field : fieldsMap.entrySet()) { 
      pigMap.put(field.getKey(), convertBSONtoPigType(field.getValue())); 
     } 
     return pigMap; 
    } else { 
     return o; 
    } 

} 

令我百思不解的是,爲什麼MongoLoader不支持與未知模式的UUID。原因是,UUID/BinData是Mongo的一部分並被廣泛使用。

也許這是他們可以解決的問題。

無論如何 - 希望這會有所幫助。

問候