Mahout - ParallelALSFactorizationJob運行時間過長？

我正在嘗試在AWS EMR集羣上運行Mahout ALS建議，但這需要比我預期的更長的時間。Mahout - ParallelALSFactorizationJob運行時間過長？

以下是我運行命令：

aws add-steps --cluster-id <cluster_id> \ 
       --steps Type=CUSTOM_JAR,\ 
         Name="Mahout ALS Factorization Job",\ 
         Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\ 
         MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\ 
         Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\ 
          "--output","s3://<my_bucket>/recproto/als-output/",\ 
          "--implicitFeedback","true",\ 
          "--lambda","150",\ 
          "--alpha","0.05",\ 
          "--numFeatures","100",\ 
          "--numIterations","3",\ 
          "--numThreadsPerSolver","4",\ 
          "--usesLongIDs","true"]

在userClicks.csv文件，有1567808評級從335636用戶和23934項。

作業在上運行10-c3.xlarge節點EMR羣集，作業運行時間超過2小時。我想知道這是正常的嗎？在我的評級文件的情況下，我應該使用哪種EMR集羣和參數的規模，以便我可以獲得更可接受的運行時間？

來源

2015-05-18 Fred Pym

我簡單地使用Spark ALS解決了這個問題。培訓過程花費不到2分鐘打開我的筆記本電腦在具有相同參數的相同數據集上。

我現在可以理解爲什麼一些機器學習算法由於性能問題而不推薦使用...（例如，Minhash算法）

來源

2015-05-19 08:59:43

Mahout - ParallelALSFactorizationJob運行時間過長？

回答

相關問題