包含HTML標記的Hadoop MapReduce作業

我有一堆大型的HTML文件，我想在它們上運行Hadoop MapReduce作業來查找最常用的單詞。我用Python編寫了我的mapper和reducer，並使用Hadoop streaming來運行它們。包含HTML標記的Hadoop MapReduce作業

這裏是我的映射：

#!/usr/bin/env python 

import sys 
import re 
import string 

def remove_html_tags(in_text): 
''' 
Remove any HTML tags that are found. 

''' 
    global flag 
    in_text=in_text.lstrip() 
    in_text=in_text.rstrip() 
    in_text=in_text+"\n" 

    if flag==True: 
     in_text="<"+in_text 
     flag=False 
    if re.search('^<',in_text)!=None and re.search('(>\n+)$', in_text)==None: 
     in_text=in_text+">" 
     flag=True 
    p = re.compile(r'<[^<]*?>') 
    in_text=p.sub('', in_text) 
    return in_text 

# input comes from STDIN (standard input) 
global flag 
flag=False 
for line in sys.stdin: 
    # remove leading and trailing whitespace, set to lowercase and remove HTMl tags 
    line = line.strip().lower() 
    line = remove_html_tags(line) 
    # split the line into words 
    words = line.split() 
    # increase counters 
    for word in words: 
     # write the results to STDOUT (standard output); 
     # what we output here will be the input for the 
     # Reduce step, i.e. the input for reducer.py 
     # 
     # tab-delimited; the trivial word count is 1 
     if word =='': continue 
     for c in string.punctuation: 
      word= word.replace(c,'') 

     print '%s\t%s' % (word, 1)

這裏是我的減速器：

#!/usr/bin/env python 

from operator import itemgetter 
import sys 

# maps words to their counts 
word2count = {} 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    word, count = line.split('\t', 1) 
    # convert count (currently a string) to int 
    try: 
     count = int(count) 
     word2count[word] = word2count.get(word, 0) + count 
    except ValueError: 
     pass 

sorted_word2count = sorted(word2count.iteritems(), 
key=lambda(k,v):(v,k),reverse=True) 

# write the results to STDOUT (standard output) 
for word, count in sorted_word2count: 
    print '%s\t%s'% (word, count)

每當我管一個小樣本的小串像「世界你好你好你好世界......」我得到排名列表的正確輸出。然而，當我嘗試使用一個小的HTML文件，並嘗試使用貓管HTML到我的映射器，我得到以下錯誤（輸入2包含了一些HTML代碼）：

[email protected]:~$ cat input2 | /home/rohanbk/mapper.py | sort | /home/rohanbk/reducer.py 
Traceback (most recent call last): 
    File "/home/rohanbk/reducer.py", line 15, in <module> 
    word, count = line.split('\t', 1) 
ValueError: need more than 1 value to unpack

任何人都可以解釋爲什麼我得到這個？另外，調試MapReduce作業程序的好方法是什麼？

來源

2009-12-03 GobiasKoffi

您可以只甚至重現bug：

echo "hello - world" | ./mapper.py | sort | ./reducer.py

問題就在這裏：

if word =='': continue 
for c in string.punctuation: 
      word= word.replace(c,'')

如果word是一個標點符號，如將是上述輸入的情況下（之後它被分割），然後它被轉換爲一個空字符串。所以，只需將替換後的空字符串檢查移動到。

來源

2009-12-03 21:53:22 codelogic

假設如果您使用cat並獲得了期望的輸出，那麼MapReduce步驟將起作用是否安全？ – GobiasKoffi 2009-12-04 02:44:31

爲了更愉快的Python/Hadoop集成體驗，您可以考慮使用Dumbo。 – drxzcl 2009-12-22 15:50:27

包含HTML標記的Hadoop MapReduce作業

回答

相關問題