2011-12-14 56 views
1

我正在使用hadoop中mapreduce的矩陣乘法示例。我想問一下,如果溢出的記錄總是等於mapinput和mapoutput記錄。我有 灑從mapinput和mapoutput記錄不同的記錄應該溢出的記錄總是等於MapReduce中的mapinput記錄或mapoutput記錄使用hadoop?

這裏是一個測試的輸出我得到:

Three by three test 
    IB = 1 
    KB = 2 
    JB = 1 
11/12/14 13:16:22 INFO input.FileInputFormat: Total input paths to process : 2 
11/12/14 13:16:22 INFO mapred.JobClient: Running job: job_201112141153_0003 
11/12/14 13:16:23 INFO mapred.JobClient: map 0% reduce 0% 
11/12/14 13:16:32 INFO mapred.JobClient: map 100% reduce 0% 
11/12/14 13:16:44 INFO mapred.JobClient: map 100% reduce 100% 
11/12/14 13:16:46 INFO mapred.JobClient: Job complete: job_201112141153_0003 
11/12/14 13:16:46 INFO mapred.JobClient: Counters: 17 
11/12/14 13:16:46 INFO mapred.JobClient: Job Counters 
11/12/14 13:16:46 INFO mapred.JobClient:  Launched reduce tasks=1 
11/12/14 13:16:46 INFO mapred.JobClient:  Launched map tasks=2 
11/12/14 13:16:46 INFO mapred.JobClient:  Data-local map tasks=2 
11/12/14 13:16:46 INFO mapred.JobClient: FileSystemCounters 
11/12/14 13:16:46 INFO mapred.JobClient:  FILE_BYTES_READ=1464 
11/12/14 13:16:46 INFO mapred.JobClient:  HDFS_BYTES_READ=528 
11/12/14 13:16:46 INFO mapred.JobClient:  FILE_BYTES_WRITTEN=2998 
11/12/14 13:16:46 INFO mapred.JobClient:  HDFS_BYTES_WRITTEN=384 
11/12/14 13:16:46 INFO mapred.JobClient: Map-Reduce Framework 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce input groups=36 
11/12/14 13:16:46 INFO mapred.JobClient:  Combine output records=0 
11/12/14 13:16:46 INFO mapred.JobClient:  Map input records=18 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce shuffle bytes=735 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce output records=15 
11/12/14 13:16:46 INFO mapred.JobClient:  Spilled Records=108 
11/12/14 13:16:46 INFO mapred.JobClient:  Map output bytes=1350 
11/12/14 13:16:46 INFO mapred.JobClient:  Combine input records=0 
11/12/14 13:16:46 INFO mapred.JobClient:  Map output records=54 
11/12/14 13:16:46 INFO mapred.JobClient:  Reduce input records=54 
11/12/14 13:16:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1 
11/12/14 13:16:46 INFO mapred.JobClient: Running job: job_local_0001 
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1 
11/12/14 13:16:46 INFO mapred.MapTask: io.sort.mb = 100 
11/12/14 13:16:46 INFO mapred.MapTask: data buffer = 79691776/99614720 
11/12/14 13:16:46 INFO mapred.MapTask: record buffer = 262144/327680 
11/12/14 13:16:46 INFO mapred.MapTask: Starting flush of map output 
11/12/14 13:16:46 INFO mapred.MapTask: Finished spill 0 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done. 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.Merger: Merging 1 sorted segments 
11/12/14 13:16:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 128 bytes 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now 
11/12/14 13:16:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/MatrixMultiply/out 
11/12/14 13:16:46 INFO mapred.LocalJobRunner: reduce > reduce 
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done. 
11/12/14 13:16:47 INFO mapred.JobClient: map 100% reduce 100% 
11/12/14 13:16:47 INFO mapred.JobClient: Job complete: job_local_0001 
11/12/14 13:16:47 INFO mapred.JobClient: Counters: 14 
11/12/14 13:16:47 INFO mapred.JobClient: FileSystemCounters 
11/12/14 13:16:47 INFO mapred.JobClient:  FILE_BYTES_READ=89412 
11/12/14 13:16:47 INFO mapred.JobClient:  HDFS_BYTES_READ=37206 
11/12/14 13:16:47 INFO mapred.JobClient:  FILE_BYTES_WRITTEN=37390 
11/12/14 13:16:47 INFO mapred.JobClient:  HDFS_BYTES_WRITTEN=164756 
11/12/14 13:16:47 INFO mapred.JobClient: Map-Reduce Framework 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce input groups=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Combine output records=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Map input records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce shuffle bytes=0 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce output records=9 
11/12/14 13:16:47 INFO mapred.JobClient:  Spilled Records=18 
11/12/14 13:16:47 INFO mapred.JobClient:  Map output bytes=180 
11/12/14 13:16:47 INFO mapred.JobClient:  Combine input records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Map output records=15 
11/12/14 13:16:47 INFO mapred.JobClient:  Reduce input records=9 
...........X[0][0]=30, Y[0][0]=9 
Bad Answer 
...........X[0][1]=36, Y[0][1]=36 
...........X[0][2]=42, Y[0][2]=42 
...........X[1][0]=66, Y[1][0]=24 
Bad Answer 
...........X[1][1]=81, Y[1][1]=81 
...........X[1][2]=96, Y[1][2]=96 
...........X[2][0]=102, Y[2][0]=39 
Bad Answer 
...........X[2][1]=126, Y[2][1]=126 
...........X[2][2]=150, Y[2][2]=150 

這個例子與代碼一起說明如下:

http://www.norstad.org/matrix-multiply/index.html

請問您能否告訴我該問題在哪裏,如何才能正確使用?由於

WL

+0

我也想提及,雖然在獨立模式下運行,但它在溢出記錄等於地圖輸入和輸出記錄(這是18)時工作正常,但在僞分佈模式下它不起作用,溢出記錄不等於mapinput和mapoutput記錄。 – waqas 2011-12-14 12:48:14

+2

溢出的意思是,它們必須溢出到磁盤,因爲RAM在分類/洗牌階段不夠用。所以這應該是最好的或非常低的零。 – 2011-12-14 12:58:40

回答

4

根據的Hadoop權威指南「溢出記錄」統計的是經作業的過程中溢出到磁盤,包括地圖和減少側溢出記錄的總數。 「溢出記錄」計數可能爲零,這非常好。一般來說,溢出的記錄意味着你已經超過了地圖輸出緩衝區中可用的內存量。擁有少量的「溢出記錄」通常不是問題。在您的mapred-site.xml中,可用RAM的設置爲io.sort.mbio.sort.spill.percent。如果表現是一個問題,你會想調整這些以儘量減少溢出的記錄。演示文稿Optimizing MapReduce Job Performance有更多細節,特別是幻燈片#12和#13。如果您不止一次泄漏,則由於需要合併泄漏,您需要支付IO的3倍處罰。如果「溢出記錄」多於「地圖輸出記錄」+「減少輸出記錄」,那麼您正在做多個溢出。請注意,RAM的數量最終會受到Java VM的堆大小的限制,因此您可能需要增加羣集大小或增加映射任務的數量,方法是增加給定作業的輸入分割以減少泄漏的數量。

在您的具體示例中,「溢出記錄」較大,因此您不止一次溢出。