在Hadoop Mapreduce字數統計中獲取最大字數

因此，我一直在關注本網站上的Mapreduce python代碼（http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/），它從文本文件返回一個字數（即單詞和它發生的次數文本）。但是，我想知道如何返回最大發生的單詞。映射器和減速情況如下 -在Hadoop Mapreduce字數統計中獲取最大字數

#Mapper 

import sys 

# input comes from STDIN (standard input) 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     print '%s\t%s' % (word, 1) 

#Reducer 

from operator import itemgetter 
import sys 

current_word = None 
current_count = 0 
word = None 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     if current_word: 
      # write result to STDOUT 
      print '%s\t%s' % (current_word, current_count) 
     current_count = count 
     current_word = word 

# do not forget to output the last word if needed! 
if current_word == word: 
    print '%s\t%s' % (current_word, current_count)

所以，我知道我需要添加一些減速的結束，但我只是不完全知道是什麼。

來源

2017-03-27 tattybojangler

所以你只是想找到具有最大計數和輸出它的單詞？ –

沒錯。計數最多的字和計數本身。 – tattybojangler

我猜測在reducer的末尾添加了一些代碼，但我試圖無濟於事。 – tattybojangler

您需要設置只有一個減速聚集所有的值（-numReduceTasks 1）

這怎麼您的減少應該是這樣的：

max_count = 0 
max_word = None 

for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 

    # convert count (currently a string) to int 
    try: 
     count = int(count) 
    except ValueError: 
     # count was not a number, so silently 
     # ignore/discard this line 
     continue 

    # this IF-switch only works because Hadoop sorts map output 
    # by key (here: word) before it is passed to the reducer 
    if current_word == word: 
     current_count += count 
    else: 
     # check if new word greater 
     if current_count > max_count: 
      max_count= current_count 
      max_word = current_word   
     current_count = count 
     current_word = word 

# do not forget to check last word if needed! 
if current_count > max_count: 
    max_count= current_count 
    max_word = current_word 

print '%s\t%s' % (max_word, max_count)

但只有一個減速你失去並行，所以也許它會如果你在第一個之後運行這個工作，速度會更快，而不是相反。這樣，你的mapper將和reducer一樣。

來源

2017-03-28 16:50:37 fi11er

在Hadoop Mapreduce字數統計中獲取最大字數

回答

相關問題