0
我目前正在接受一個大數據類,我的一個項目是在本地設置的Hadoop集羣上運行我的Mapper/Reducer。如何使用Hadoop Streaming在本地Hadoop集羣中運行MRJob?
我一直在使用Python以及類的MRJob庫。
這是我目前用於Mapper/Reducer的Python代碼。
from mrjob.job import MRJob
from mrjob.step import MRStep
import re
import os
WORD_RE = re.compile(r"[\w']+")
choice = ""
class MRPrepositionsFinder(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_words),
MRStep(reducer=self.reducer_find_prep_word)
]
def mapper_get_words(self, _, line):
# set word_list to indicators, convert to lowercase, and strip whitespace
word_list = set(line.lower().strip() for line in open("/hdfs/user/user/indicators.txt"))
# set filename to map_input_file
fileName = os.environ['map_input_file']
# itterate through each word in line
for word in WORD_RE.findall(line):
# if word is in indicators, yield chocie as filename
if word.lower() in word_list:
choice = fileName.split('/')[5]
yield (choice, 1)
def reducer_find_prep_word(self, choice, counts):
# each item of choice is (choice, count),
# so yielding results in value=choice, key=count
yield (choice, sum(counts))
if __name__ == '__main__':
MRPrepositionsFinder.run()
當我嘗試在我的Hadoop集羣運行的代碼 - 我用下面的命令:
python hrc_discover.py /hdfs/user/user/HRCmail/* -r hadoop --hadoop-bin /usr/bin/hadoop > /hdfs/user/user/output
不幸的是,每次我跑我得到以下錯誤的命令:
No configs found; falling back on auto-configuration STDERR: Error: JAVA_HOME is not set and could not be found. Traceback (most recent call last): File "hrc_discover.py", line 37, in MRPrepositionsFinder.run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 432, in run mr_job.execute() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/job.py", line 453, in execute super(MRJob, self).execute() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 161, in execute self.run_job() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/launch.py", line 231, in run_job runner.run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/runner.py", line 437, in run self._run() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 346, in _run self._find_binaries_and_jars() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 361, in _find_binaries_and_jars self.get_hadoop_version() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/hadoop.py", line 198, in get_hadoop_version return self.fs.get_hadoop_version() File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 117, in get_hadoop_version stdout = self.invoke_hadoop(['version'], return_stdout=True) File "/usr/lib/python3.5/site-packages/mrjob-0.6.0.dev0-py3.5.egg/mrjob/fs/hadoop.py", line 172, in invoke_hadoop raise CalledProcessError(proc.returncode, args) subprocess.CalledProcessError: Command '['/usr/bin/hadoop', 'version']' returned non-zero exit status 1
我環顧了互聯網,發現我需要導出我的JAVA_HOME變量 - 但我不想設置任何可能會破壞我的設置的東西。
任何幫助,將不勝感激,謝謝!