2012-08-09 74 views
0

CSV文件由用於用戶偏好布爾數據(用戶ID,項ID)的。 該文件由預處理器檢查是否存在任何不一致。我也手動檢查過,數據似乎是一致且正確的格式。 有兩點需要注意: - 如果只有一個輸入文件,以Hadoop的工作就是說作業永遠不會失敗,如果所有的偏好導出到一個CSV與沒有重複的條目(用戶ID,商品ID) - 的作業隨機失敗,其中hadoop目錄中有多個csv文件,這些文件是用戶首選項的初始轉儲以及用戶首選項的每日增量文件。Mahout的項目相似工作隨機失敗,數組索引越界異常對CSV數據?

如果CSV數據是一致的,正確的整個工作不應該失敗,異常的ArrayIndexOutOfBounds。如果跨越增量文件的(userid,itemid)有重複條目,那麼作業可能會失敗。由於布爾偏好,許多這些條目在幾個增量文件中是重複的。

日誌似乎並不輸出造成錯誤數據位。這裏是日誌:

2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobInProgress: job_201208021510_0221: nMaps=2 nReduces=1 max=-1 
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobTracker: Job job_201208021510_0221 added successfully for user 'deploy' to queue 'default' 
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.AuditLogger: USER=deploy IP=127.0.0.1 OPERATION=SUBMIT_JOB TARGET=job_201208021510_0221 RESULT=SUCCESS 
2012-08-09 15:03:22,652 INFO org.apache.hadoop.mapred.JobTracker: Initializing job_201208021510_0221 
2012-08-09 15:03:22,653 INFO org.apache.hadoop.mapred.JobInProgress: Initializing job_201208021510_0221 
2012-08-09 15:03:23,023 INFO org.apache.hadoop.mapred.JobInProgress: jobToken generated and stored with users keys in /zenius/hadoop/tmp/mapred/system/job_201208021510_0221/jobToken 
2012-08-09 15:03:23,027 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201208021510_0221 = 56518256. Number of splits = 2 
2012-08-09 15:03:23,027 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208021510_0221_m_000000 has split on node:/default-rack/localhost 
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208021510_0221_m_000001 has split on node:/default-rack/localhost 
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: job_201208021510_0221 LOCALITY_WAIT_FACTOR=1.0 
2012-08-09 15:03:23,028 INFO org.apache.hadoop.mapred.JobInProgress: Job job_201208021510_0221 initialized successfully with 2 map tasks and 1 reduce tasks. 
2012-08-09 15:03:25,787 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_SETUP) 'attempt_201208021510_0221_m_000003_0' to tip task_201208021510_0221_m_000003, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:03:31,794 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201208021510_0221_m_000003_0' has completed task_201208021510_0221_m_000003 successfully. 
2012-08-09 15:03:31,795 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_0' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000 
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000001_0' to tip task_201208021510_0221_m_000001, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:03:31,796 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000001 
2012-08-09 15:03:37,800 INFO org.apache.hadoop.mapred.JobInProgress: Task 'attempt_201208021510_0221_m_000001_0' has completed task_201208021510_0221_m_000001 successfully. 
2012-08-09 15:03:37,801 INFO org.apache.hadoop.mapred.JobTracker: Adding task (REDUCE) 'attempt_201208021510_0221_r_000000_0' to tip task_201208021510_0221_r_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:03:49,807 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_0: java.lang.ArrayIndexOutOfBoundsException: 1 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47) 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) 
at org.apache.hadoop.mapred.Child.main(Child.java:249) 

2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000 
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_1' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000 
2012-08-09 15:03:52,810 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_0' 
2012-08-09 15:04:14,603 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_1: java.lang.ArrayIndexOutOfBoundsException: 1 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47) 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) 
at org.apache.hadoop.mapred.Child.main(Child.java:249) 

2012-08-09 15:04:17,606 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000 
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_2' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000 
2012-08-09 15:04:17,607 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_1' 
2012-08-09 15:04:35,618 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_2: java.lang.ArrayIndexOutOfBoundsException: 1 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47) 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) 
at org.apache.hadoop.mapred.Child.main(Child.java:249) 

2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobInProgress: Choosing a failed task task_201208021510_0221_m_000000 
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobTracker: Adding task (MAP) 'attempt_201208021510_0221_m_000000_3' to tip task_201208021510_0221_m_000000, for tracker 'tracker_localhost:localhost/127.0.0.1:50158' 
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobInProgress: Choosing data-local task task_201208021510_0221_m_000000 
2012-08-09 15:04:38,621 INFO org.apache.hadoop.mapred.JobTracker: Removing task 'attempt_201208021510_0221_m_000000_2' 
2012-08-09 15:04:56,632 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_201208021510_0221_m_000000_3: java.lang.ArrayIndexOutOfBoundsException: 1 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:47) 
at org.apache.mahout.cf.taste.hadoop.item.ItemIDIndexMapper.map(ItemIDIndexMapper.java:31) 
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) 
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) 
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) 
at org.apache.hadoop.mapred.Child$4.run(Child.java:255) 
at java.security.AccessController.doPrivileged(Native Method) 
at javax.security.auth.Subject.doAs(Subject.java:396) 
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) 
at org.apache.hadoop.mapred.Child.main(Child.java:249) 

2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.TaskInProgress: TaskInProgress task_201208021510_0221_m_000000 has failed 4 times. 
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: TaskTracker at 'localhost' turned 'flaky' 
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: Aborting job job_201208021510_0221 
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobInProgress: Killing job 'job_201208021510_0221' 
2012-08-09 15:04:59,635 INFO org.apache.hadoop.mapred.JobTracker: Adding task (JOB_CLEANUP) 'attempt_201208021510_0221_m_000002_0' to tip... 

回答

0

不,它絕對意味着在數據的某個地方有一個壞行。最有可能的罪魁禍首是流浪空行,標題行,「註釋」行,或在同一個目錄中的一些雜散文件中像_SUCCESS。