找到最小數量的hadoop streaming python

我是新來hadoop框架和地圖減少抽象。找到最小數量的hadoop streaming python

基本上，我想找到一個巨大的文本文件中的最小號（分隔「」）

所以，這裏是我的代碼 mapper.py

#!/usr/bin/env python 

import sys 

# input comes from STDIN (standard input) 
for line in sys.stdin: 
# remove leading and trailing whitespace 
line = line.strip() 
# split the line into words 
numbers = line.split(",") 
# increase counters 
for number in numbers: 
    # write the results to STDOUT (standard output); 
    # what we output here will be the input for the 
    # Reduce step, i.e. the input for reducer.py 
    # 
    # tab-delimited; the trivial word count is 1 
    print '%s\t%s' % (number, 1)

減速

#!/usr/bin/env python 

from operator import itemgetter 
import sys 
smallest_number = sys.float_info.max 
for line in sys.stdin: 
# remove leading and trailing whitespace 
    line = line.strip() 

# parse the input we got from mapper.py 
    number, count = line.split('\t', 1) 
    try: 
      number = float(number) 
    except ValueError: 
      continue 

    if number < smallest_number: 
     smallest_number = number 
     print smallest_number <---- i think the error is here... there is no key value thingy 

    print smallest_number

我收到的錯誤：

 12/10/04 12:07:22 ERROR streaming.StreamJob: Job not successful. Error: NA 
     12/10/04 12:07:22 INFO streaming.StreamJob: killJob... 
      Streaming Command Failed!

來源

2012-10-04 Fraz

你會得到什麼樣的結果？有什麼問題？你在談論什麼「關鍵價值」？ – Junuxx

@Junuxx：嗨..我剛剛發佈了錯誤..基本上..如何將地圖減少在文本文件中查找最小數量的抽象看起來像？/ 我說的錯誤是.. mapper給出（數字，1）與字數統計示例中的映射器基本相同的格式。在減速機中，我所關心的是數字..我把這個數字與當前最小的數字進行比較，然後進行交換？ – Fraz

在沒有Hadoop的情況下進行調試可能會有幫助：'cat input | ./mapper.py |排序| 。/ reducer.py'這是否成功運行？ –

首先，我希望您注意，除非僅使用一個減速器，否則您的解決方案將無法正常工作。事實上，如果你使用多個減速器，那麼每個減速器將吐出它接收到的最小數量，並且最終會得到多個數字。但接下來的問題是，如果我只需要使用一個reducer來解決這個問題（即只有一個任務），那麼通過使用MapReduce可以獲得什麼？這裏的技巧是映射器將並行運行。另一方面，您不希望映射器輸出每個讀取的數字，否則一個reducer將不得不查看整個數據，這對順序解決方案沒有任何改進。解決這個問題的方法是讓每個映射器只輸出它讀取的最小數量。此外，由於您希望所有映射器輸出都轉到同一個縮減器，所以映射器輸出鍵在所有映射器中必須相同。

映射器看起來就像這樣：

#!/usr/bin/env python        

import sys 

smallest = None 
for line in sys.stdin: 
    # remove leading and trailing whitespace   
    line = line.strip() 
    # split the line into words      
    numbers = line.split(",") 
    s = min([float(x) for x in numbers]) 
    if smallest == None or s < smallest: 
    smallest = s 

print '%d\t%f' % (0, smallest)

減速機：

#!/usr/bin/env python           

import sys 

smallest = None 
for line in sys.stdin: 
    # remove leading and trailing whitespace      
    line = line.strip() 
    s = float(line.split('\t')[1]) 
    if smallest == None or s < smallest: 
    smallest = s 

print smallest

還有其他可能的方法來解決這個問題，例如使用MapReduce框架本身的數字，所以排序減速機收到的第一個數字是最小的。如果你想了解更多的MapReduce編程範例，你可以閱讀this tutorial with examples, from my blog。

來源

2013-08-18 02:22:47 had00b

找到最小數量的hadoop streaming python

回答

相關問題