2015-10-25 38 views
2

我有一個python文件來計算在Hadoop(版本2.6.0)上使用mrjob的bigrams,但是我沒有得到我期望的輸出因爲我無法破譯我的終端中的輸出,因爲我錯了。「使用Hadoop和mrjob發現」步驟1中的計數器:找不到計數器「

我的代碼:

regex_for_words = re.compile(r"\b[\w']+\b") 

class BiCo(MRJob): 
    OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol 

    def mapper(self, _, line): 
    words = regex_for_words.findall(line) 
    wordsinline = list() 
    for word in words: 
     wordsinline.append(word.lower()) 
    wordscounter = 0 
    totalwords = len(wordsinline) 
    for word in wordsinline: 
     if wordscounter < (totalwords - 1): 
      nextword_pos = wordscounter+1 
      nextword = wordsinline[nextword_pos] 
      bigram = word, nextword 
      wordscounter +=1 
      yield (bigram, 1) 

    def combiner(self, bigram, counts): 
    yield (bigram, sum(counts)) 

    def reducer(self, bigram, counts): 
    yield (bigram, str(sum(counts))) 

if __name__ == '__main__': 
    BiCo.run() 

我寫的代碼在我的映射功能(基本上,一切行動通過「產量」行)我的本地機器上,以確保我的代碼被抓二元語法如預期,所以我認爲它應該工作得很好......但是,當然會出現一些錯誤。

當我在Hadoop服務器上運行代碼時,我得到以下輸出(道歉,如果這是超過必要的 - 屏幕輸出大量的信息,我還不確定什麼將有助於珩磨對問題區域):

HADOOP: 2015-10-25 17:00:46,992 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Running job: job_1438612881113_6410 
HADOOP: 2015-10-25 17:00:52,110 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1376)) - Job job_1438612881113_6410 running in uber mode : false 
HADOOP: 2015-10-25 17:00:52,111 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 0% reduce 0% 
HADOOP: 2015-10-25 17:00:58,171 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 33% reduce 0% 
HADOOP: 2015-10-25 17:01:00,184 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 0% 
HADOOP: 2015-10-25 17:01:07,222 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 100% 
HADOOP: 2015-10-25 17:01:08,239 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1394)) - Job job_1438612881113_6410 completed successfully 
HADOOP: 2015-10-25 17:01:08,321 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1401)) - Counters: 51 
HADOOP:   File System Counters 
HADOOP:     FILE: Number of bytes read=2007840 
HADOOP:     FILE: Number of bytes written=4485245 
HADOOP:     FILE: Number of read operations=0 
HADOOP:     FILE: Number of large read operations=0 
HADOOP:     FILE: Number of write operations=0 
HADOOP:     HDFS: Number of bytes read=1013129 
HADOOP:     HDFS: Number of bytes written=0 
HADOOP:     HDFS: Number of read operations=12 
HADOOP:     HDFS: Number of large read operations=0 
HADOOP:     HDFS: Number of write operations=2 
HADOOP:   Job Counters 
HADOOP:     Killed map tasks=1 
HADOOP:     Launched map tasks=4 
HADOOP:     Launched reduce tasks=1 
HADOOP:     Rack-local map tasks=4 
HADOOP:     Total time spent by all maps in occupied slots (ms)=33282 
HADOOP:     Total time spent by all reduces in occupied slots (ms)=12358 
HADOOP:     Total time spent by all map tasks (ms)=16641 
HADOOP:     Total time spent by all reduce tasks (ms)=6179 
HADOOP:     Total vcore-seconds taken by all map tasks=16641 
HADOOP:     Total vcore-seconds taken by all reduce tasks=6179 
HADOOP:     Total megabyte-seconds taken by all map tasks=51121152 
HADOOP:     Total megabyte-seconds taken by all reduce tasks=18981888 
HADOOP:   Map-Reduce Framework 
HADOOP:     Map input records=28214 
HADOOP:     Map output records=133627 
HADOOP:     Map output bytes=2613219 
HADOOP:     Map output materialized bytes=2007852 
HADOOP:     Input split bytes=304 
HADOOP:     Combine input records=133627 
HADOOP:     Combine output records=90382 
HADOOP:     Reduce input groups=79518 
HADOOP:     Reduce shuffle bytes=2007852 
HADOOP:     Reduce input records=90382 
HADOOP:     Reduce output records=0 
HADOOP:     Spilled Records=180764 
HADOOP:     Shuffled Maps =3 
HADOOP:     Failed Shuffles=0 
HADOOP:     Merged Map outputs=3 
HADOOP:     GC time elapsed (ms)=93 
HADOOP:     CPU time spent (ms)=7940 
HADOOP:     Physical memory (bytes) snapshot=1343377408 
HADOOP:     Virtual memory (bytes) snapshot=14458105856 
HADOOP:     Total committed heap usage (bytes)=4045406208 
HADOOP:   Shuffle Errors 
HADOOP:     BAD_ID=0 
HADOOP:     CONNECTION=0 
HADOOP:     IO_ERROR=0 
HADOOP:     WRONG_LENGTH=0 
HADOOP:     WRONG_MAP=0 
HADOOP:     WRONG_REDUCE=0 
HADOOP:   Unencodable output 
HADOOP:     TypeError=79518 
HADOOP:   File Input Format Counters 
HADOOP:     Bytes Read=1012825 
HADOOP:   File Output Format Counters 
HADOOP:     Bytes Written=0 
HADOOP: 2015-10-25 17:01:08,321 INFO [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: hdfs:///user/andersaa/si601f15lab5_output 
Counters from step 1: 
    (no counters found) 

我狼狽,爲什麼沒有計數器會從第1步中找到(什麼我假設是我的代碼映射器部分,這可能是一個錯誤的假設) 。如果我正確讀取了任何Hadoop輸出,它看起來至少已經到了reduce階段(因爲有Reduce Input組),並且沒有發現任何Shuffling錯誤。我認爲在「不可輸入輸出:TypeError = 79518」中可能會出現一些問題的答案,但是我已經完成的大量谷歌搜索幫助瞭解了這種錯誤。

任何幫助或見解都非常感謝。

回答

0

一個問題是在編碼映射器的二元組中。它被上述的編碼方式,兩字組是Python類型「元組」:

>>> word = 'the' 
>>> word2 = 'boy' 
>>> bigram = word, word2 
>>> type(bigram) 
<type 'tuple'> 

一般,普通字符串用作密鑰。相反,將bigram創建爲一個字符串。你能做到這一點的方法之一是:

bigram = '-'.join((word, nextword)) 

當我使你的程序的變化,然後我看到的輸出是這樣的:

automatic-translation 1 
automatic-vs 1 
automatically-focus 1 
automatically-learn 1 
automatically-learning 1 
automatically-translate 1 
available-including 1 
available-without 1 

另外一個提示:儘量-q您的命令行來沉默所有的hadoop中間噪音。有時它會阻礙你。

HTH。