2012-12-11 203 views
2

這是我的第一篇SO帖子,如果我錯過了任何重要的東西,請讓我知道。我是Mahout/Hadoop初學者,並且正在嘗試將分佈式推薦引擎放在一起。Mahout RecommenderJob不收斂

爲了模擬在遠程集羣上的工作,我在我的機器上設置了hadoop以與位於我的機器上的Ubuntu VM(使用VirtualBox)通信,該機器上安裝了hadoop。這個設置似乎工作正常,我現在試圖在一個非常小的試用數據集上運行Mahout的RecommenderJob作爲測試。

輸入包括含有格式大約50用戶偏好的.csv文件(保存在Hadoop的DFS)的:userID, itemID, preference ......我正在運行的命令是:

hadoop jar /Users/MyName/src/trunk/core/target/mahout-core-0.8-SNAPSHOT-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -Dmapred.input.dir=/user/MyName/Recommendations/input/TestRatings.csv -Dmapred.output.dir=/user/MyName/Recommendations/output -s SIMILARITY_PEARSON_CORELLATION 

其中TestRatings.csv是包含首選項的文件和output是所需的輸出目錄。

起初的工作看起來像它的運行良好,我也得到了以下的輸出:

12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --maxPrefsPerUser=[10], --maxPrefsPerUserInItemSimilarity=[1000], --maxSimilaritiesPerItem=[100], --minPrefsPerUser=[1], --numRecommendations=[10], --similarityClassname=[SIMILARITY_PEARSON_CORELLATION], --startPhase=[0], --tempDir=[temp]} 
12/12/11 12:26:21 INFO common.AbstractJob: Command line arguments: {--booleanData=[false], --endPhase=[2147483647], --input=[/user/Naaman/Delphi/input/TestRatings.csv], --maxPrefsPerUser=[1000], --minPrefsPerUser=[1], --output=[temp/preparePreferenceMatrix], --ratingShift=[0.0], --startPhase=[0], --tempDir=[temp]} 
12/12/11 12:26:21 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 
12/12/11 12:26:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 
12/12/11 12:26:22 INFO input.FileInputFormat: Total input paths to process : 1 
12/12/11 12:26:22 WARN snappy.LoadSnappy: Snappy native library not loaded 
12/12/11 12:26:22 INFO mapred.JobClient: Running job: job_local_0001 
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null 
12/12/11 12:26:22 INFO mapred.MapTask: io.sort.mb = 100 
12/12/11 12:26:22 INFO mapred.MapTask: data buffer = 79691776/99614720 
12/12/11 12:26:22 INFO mapred.MapTask: record buffer = 262144/327680 
12/12/11 12:26:22 INFO mapred.MapTask: Starting flush of map output 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new compressor 
12/12/11 12:26:22 INFO mapred.MapTask: Finished spill 0 
12/12/11 12:26:22 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 
12/12/11 12:26:22 INFO mapred.LocalJobRunner: 
12/12/11 12:26:22 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. 
12/12/11 12:26:22 INFO mapred.Task: Using ResourceCalculatorPlugin : null 
12/12/11 12:26:22 INFO mapred.ReduceTask: ShuffleRamManager: MemoryLimit=1491035776, MaxSingleShuffleLimit=372758944 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor 
12/12/11 12:26:22 INFO compress.CodecPool: Got brand-new decompressor 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging on-disk files 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for merging in memory files 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread waiting: Thread for merging on-disk files 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Thread started: Thread for polling Map Completion Events 
12/12/11 12:26:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 
12/12/11 12:26:23 INFO mapred.JobClient: map 100% reduce 0% 
12/12/11 12:26:28 INFO mapred.LocalJobRunner: reduce > copy > 
12/12/11 12:26:31 INFO mapred.LocalJobRunner: reduce > copy > 
12/12/11 12:26:37 INFO mapred.LocalJobRunner: reduce > copy > 

但隨後的最後三行無限重複(我把它一夜之間...),與兩行:

12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Need another 1 map output(s) where 0 is already in progress 
12/12/11 12:27:22 INFO mapred.ReduceTask: attempt_local_0001_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 

重複每十行。

我不確定我的輸入是否有問題,或者試驗數據的細小尺寸是否讓事情變得糟糕。任何幫助和/或建議的最佳途徑去這件事將不勝感激。

p.s.我試圖按照https://www.box.com/s/041rdjeh7sny128r2uki的說明操作

回答

1

這實際上是一個Hadoop或羣集問題。它正在等待未到來的映射器輸出。在映射階段尋找早期的失敗。

+0

嗨肖恩。非常感謝您的快速回復。我正在研究hadoop配置,並將隨我找到的任何解決方案進行更新。期待閱讀我的新作品Mahout in Action。 – CaslonAmp