當讀取csv中的空單元格時ArrayIndexOutofBounds MapReduce

我想爲以下數據運行MapReduce程序。當讀取csv中的空單元格時ArrayIndexOutofBounds MapReduce

這是我的映射代碼：

@Override 
protected void map(Object key, Text value, Mapper.Context context) throws IOException, ArrayIndexOutOfBoundsException,InterruptedException { 
    String tokens[]=value.toString().split(","); 
    if(tokens[6]!=null){ 
     context.write(new Text(tokens[6]), new IntWritable(1)); 
    } 

}

由於我的一些單元的數據是空的，當我試圖讀取該列Carrier_delay我得到下面的錯誤。請指教。

17/04/13 20:45:29 INFO mapreduce.Job: Task Id : attempt_1491849620104_0017_m_000000_0, Status : FAILED 
Error: java.lang.ArrayIndexOutOfBoundsException: 6 
    at Test.TestMapper.map(TestMapper.java:22) 
    at Test.TestMapper.map(TestMapper.java:17) 
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145) 
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340) 
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422)

Configuration conf = new Configuration(); 
Job job = Job.getInstance(conf,"IP Access"); 
job.setJarByClass(Test.class); 
job.setMapperClass(TestMapper.class); 

job.setMapOutputKeyClass(Text.class); 
job.setMapOutputValueClass(IntWritable.class); 

job.setReducerClass(TestReducer.class); 
job.setOutputKeyClass(Text.class); 
job.setOutputValueClass(IntWritable.class); 

FileInputFormat.addInputPath(job, new Path(args[0])); 
FileOutputFormat.setOutputPath(job, new Path(args[1])); 
System.exit(job.waitForCompletion(true) ? 0 : 1);

來源

2017-04-14 Harish

問題是在行：if(tokens[6]!=null){。

問題是你想採取令牌[6]的值，然後檢查它是否爲空。但是，某些行只包含六列（第七列爲空），因此tokens在這些情況下是六元素數組。這意味着它包含從tokens[0]到tokens[5]的值。當你嘗試訪問tokens[6]時，你超出了數組的大小，所以你得到一個ArrayIndexOutOfBoundsException。

正確的方法做你想要的是：

IntWritable one = new IntWritable(1); //this saves some time ;) 
Text keyOutput = new Text(); //the same goes here 

@Override 
protected void map(Object key, Text value, Mapper.Context context) throws IOException, ArrayIndexOutOfBoundsException,InterruptedException { 
    String tokens[]=value.toString().split(","); 
    if(tokens.length == 7){ 
     keyOutput.set(tokens[6]); 
     context.write(keyOutput, one); 
    } 

}

更多提示：從你的部分代碼來看，我想你要計算的時間出現載體延遲的特定值的數量。在這種情況下，您也可以使用組合器來加速過程，就像WordCount程序一樣。您還可以將載體延遲解析爲IntWritable以節省時間和空間。

來源

2017-04-14 09:31:12 vefthym

非常感謝它的工作 – Harish

所有列在圖像中示出的一個？如果是這樣的話，記住java數組是0索引的，並且你的列的取值範圍是0到5，所以記號[6]超出了範圍。或者根據您的需要的邏輯，你也可以在你的，如果添加驗證：

if(tokens.length > n && tokens[n]!=null){ context.write(new Text(tokens[n]), new IntWritable(1)); }

來源

2017-04-14 01:23:53

YEAR，AIRLINE_ID，ORIGIN_AIRPORT_ID，ORIGIN，DEST_AIRPORT_ID \t，ARR_DELAY \t CARRIER_DELAY。這是命令 – Harish

再次，我添加&&條件後得到相同的錯誤.17/04/13 21:30:19信息mapreduce.Job：任務ID：attempt_1491849620104_0019_m_000000_0，狀態：失敗錯誤：java.lang.ArrayIndexOutOfBoundsException：6 （TestMapper.java:17） – Harish

您正在使用逗號作爲分隔符，但也許有空值的行，沒有正確的逗號數，所以令牌數組的值較小，粘貼文件的屏幕截圖，但打開它在文本編輯器或在與貓的命令行，所以我們可以檢查 –

運營商延遲是第二場，所以你需要使用訪問令牌[1]由於數組索引從0您也可以啓動在訪問特定索引之前進行長度檢查。由於總共有6列，因此令牌[6]給出錯誤。如果您正在訪問最後一個字段，它將是令牌[5] I.e長度減1.

來源

2017-04-14 01:30:20 SurjanSRawat

年，AIRLINE_ID，ORIGIN_AIRPORT_ID，產地，DEST_AIRPORT_ID \t，AR R_DELAY \t CARRIER_DELA年。這些是我的專欄。我在中間對數據進行了screnshot – Harish

您可以只捕獲您嘗試訪問的文件嗎？並看看你是否得到所有7個領域的分隔符。如果某些字段爲空格，則分隔符逗號應該仍然存在，然後可以進行長度檢查。如果您可以粘貼hdfs文件中的前10條記錄，那更好。 – SurjanSRawat

thiyagarajans-的MacBook-PRO：數據集尼爾默爾$貓ONTIME.csv |頭-10 |列-s -t 「年」，「AIRLINE_ID」，「ORIGIN_AIRPORT_ID」，「原產地」，「DEST_AIRPORT_ID」，「ARR_DELAY」，」 CARRIER_DELAY」，「WEATHER_DELAY」，「NAS_DELAY」，「SECURITY_DELAY」，「LATE_AIRCRAFT_DELAY」， 2016,19805,11298，「DFW」，11433，-6.00 ,,,,,, 2016,19805,11298，「DFW」，11433，-12.00 ,,,,,, 2016,19805,11298，「DFW」，11433,7.00 ,,,,,, 2016,19805,11298，「DFW」，11433，-5.00 ,,,, ,, 2016,19805,11298，「DFW」，11433,113.00,0.00,0.00,47.00,0.00,66.00， – Harish

當讀取csv中的空單元格時ArrayIndexOutofBounds MapReduce

回答

相關問題