我有一個python文件來計算在Hadoop(版本2.6.0)上使用mrjob的bigrams,但是我沒有得到我期望的輸出因爲我無法破譯我的終端中的輸出,因爲我錯了。「使用Hadoop和mrjob發現」步驟1中的計數器:找不到計數器「
我的代碼:
regex_for_words = re.compile(r"\b[\w']+\b")
class BiCo(MRJob):
OUTPUT_PROTOCOL = mrjob.protocol.RawProtocol
def mapper(self, _, line):
words = regex_for_words.findall(line)
wordsinline = list()
for word in words:
wordsinline.append(word.lower())
wordscounter = 0
totalwords = len(wordsinline)
for word in wordsinline:
if wordscounter < (totalwords - 1):
nextword_pos = wordscounter+1
nextword = wordsinline[nextword_pos]
bigram = word, nextword
wordscounter +=1
yield (bigram, 1)
def combiner(self, bigram, counts):
yield (bigram, sum(counts))
def reducer(self, bigram, counts):
yield (bigram, str(sum(counts)))
if __name__ == '__main__':
BiCo.run()
我寫的代碼在我的映射功能(基本上,一切行動通過「產量」行)我的本地機器上,以確保我的代碼被抓二元語法如預期,所以我認爲它應該工作得很好......但是,當然會出現一些錯誤。
當我在Hadoop服務器上運行代碼時,我得到以下輸出(道歉,如果這是超過必要的 - 屏幕輸出大量的信息,我還不確定什麼將有助於珩磨對問題區域):
HADOOP: 2015-10-25 17:00:46,992 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1355)) - Running job: job_1438612881113_6410
HADOOP: 2015-10-25 17:00:52,110 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1376)) - Job job_1438612881113_6410 running in uber mode : false
HADOOP: 2015-10-25 17:00:52,111 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 0% reduce 0%
HADOOP: 2015-10-25 17:00:58,171 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 33% reduce 0%
HADOOP: 2015-10-25 17:01:00,184 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 0%
HADOOP: 2015-10-25 17:01:07,222 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1383)) - map 100% reduce 100%
HADOOP: 2015-10-25 17:01:08,239 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1394)) - Job job_1438612881113_6410 completed successfully
HADOOP: 2015-10-25 17:01:08,321 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1401)) - Counters: 51
HADOOP: File System Counters
HADOOP: FILE: Number of bytes read=2007840
HADOOP: FILE: Number of bytes written=4485245
HADOOP: FILE: Number of read operations=0
HADOOP: FILE: Number of large read operations=0
HADOOP: FILE: Number of write operations=0
HADOOP: HDFS: Number of bytes read=1013129
HADOOP: HDFS: Number of bytes written=0
HADOOP: HDFS: Number of read operations=12
HADOOP: HDFS: Number of large read operations=0
HADOOP: HDFS: Number of write operations=2
HADOOP: Job Counters
HADOOP: Killed map tasks=1
HADOOP: Launched map tasks=4
HADOOP: Launched reduce tasks=1
HADOOP: Rack-local map tasks=4
HADOOP: Total time spent by all maps in occupied slots (ms)=33282
HADOOP: Total time spent by all reduces in occupied slots (ms)=12358
HADOOP: Total time spent by all map tasks (ms)=16641
HADOOP: Total time spent by all reduce tasks (ms)=6179
HADOOP: Total vcore-seconds taken by all map tasks=16641
HADOOP: Total vcore-seconds taken by all reduce tasks=6179
HADOOP: Total megabyte-seconds taken by all map tasks=51121152
HADOOP: Total megabyte-seconds taken by all reduce tasks=18981888
HADOOP: Map-Reduce Framework
HADOOP: Map input records=28214
HADOOP: Map output records=133627
HADOOP: Map output bytes=2613219
HADOOP: Map output materialized bytes=2007852
HADOOP: Input split bytes=304
HADOOP: Combine input records=133627
HADOOP: Combine output records=90382
HADOOP: Reduce input groups=79518
HADOOP: Reduce shuffle bytes=2007852
HADOOP: Reduce input records=90382
HADOOP: Reduce output records=0
HADOOP: Spilled Records=180764
HADOOP: Shuffled Maps =3
HADOOP: Failed Shuffles=0
HADOOP: Merged Map outputs=3
HADOOP: GC time elapsed (ms)=93
HADOOP: CPU time spent (ms)=7940
HADOOP: Physical memory (bytes) snapshot=1343377408
HADOOP: Virtual memory (bytes) snapshot=14458105856
HADOOP: Total committed heap usage (bytes)=4045406208
HADOOP: Shuffle Errors
HADOOP: BAD_ID=0
HADOOP: CONNECTION=0
HADOOP: IO_ERROR=0
HADOOP: WRONG_LENGTH=0
HADOOP: WRONG_MAP=0
HADOOP: WRONG_REDUCE=0
HADOOP: Unencodable output
HADOOP: TypeError=79518
HADOOP: File Input Format Counters
HADOOP: Bytes Read=1012825
HADOOP: File Output Format Counters
HADOOP: Bytes Written=0
HADOOP: 2015-10-25 17:01:08,321 INFO [main] streaming.StreamJob (StreamJob.java:submitAndMonitorJob(1022)) - Output directory: hdfs:///user/andersaa/si601f15lab5_output
Counters from step 1:
(no counters found)
我狼狽,爲什麼沒有計數器會從第1步中找到(什麼我假設是我的代碼映射器部分,這可能是一個錯誤的假設) 。如果我正確讀取了任何Hadoop輸出,它看起來至少已經到了reduce階段(因爲有Reduce Input組),並且沒有發現任何Shuffling錯誤。我認爲在「不可輸入輸出:TypeError = 79518」中可能會出現一些問題的答案,但是我已經完成的大量谷歌搜索幫助瞭解了這種錯誤。
任何幫助或見解都非常感謝。