2017-03-27 93 views
0

因此,我一直在關注本網站上的Mapreduce python代碼(http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/),它從文本文件返回一個字數(即單詞和它發生的次數文本)。但是,我想知道如何返回最大發生的單詞。映射器和減速情況如下 -在Hadoop Mapreduce字數統計中獲取最大字數

#Mapper 

import sys 

# input comes from STDIN (standard input) 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     print '%s\t%s' % (word, 1) 

#Reducer 

from operator import itemgetter 
import sys 

current_word = None 
current_count = 0 
word = None 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     if current_word: 
      # write result to STDOUT 
      print '%s\t%s' % (current_word, current_count) 
     current_count = count 
     current_word = word 

# do not forget to output the last word if needed! 
if current_word == word: 
    print '%s\t%s' % (current_word, current_count) 

所以,我知道我需要添加一些減速的結束,但我只是不完全知道是什麼。

+0

所以你只是想找到具有最大計數和輸出它的單詞? –

+0

沒錯。計數最多的字和計數本身。 – tattybojangler

+0

我猜測在reducer的末尾添加了一些代碼,但我試圖無濟於事。 – tattybojangler

回答

1

您需要設置只有一個減速聚集所有的值(-numReduceTasks 1

這怎麼您的減少應該是這樣的:

max_count = 0 
max_word = None 

for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     # check if new word greater 
     if current_count > max_count: 
      max_count= current_count 
      max_word = current_word   
     current_count = count 
     current_word = word 

# do not forget to check last word if needed! 
if current_count > max_count: 
    max_count= current_count 
    max_word = current_word 

print '%s\t%s' % (max_word, max_count) 

但只有一個減速你失去並行,所以也許它會如果你在第一個之後運行這個工作,速度會更快,而不是相反。這樣,你的mapper將和reducer一樣。