I got a little problem I want to use nltk corpus in hdfs,But failed.For example I want to load nltk.stopwords in my python code.
I use this http://eigenjoy.com/2009/11/18/how-to-use-cascading-with-hadoop-streaming/
我盡我所能說,但我不知道如何改變它在我的工作。我NLTK文件名是NLTK-2.0.1.rc1我pyam文件名是PyYAML.3.0.1所以我commad是:如何在HDFS中導入nltk語料庫當我使用hadoop streaming
zip -r nltkandyaml.zip nltk-2.0.1.rc1 PyYAML.3.0.1
然後它說:「MV ntlkandyaml.zip /路徑/到/在/你/mapper/will/be/nltkandyaml.mod」
我mapper.py保存/home/mapreduce/mapper.py所以我的命令是:
mv ntlkandyaml.zip /home/mapreduce/nltkandyaml.mod
是這樣嗎?
然後我壓縮我的文集禁用詞:
zip -r /nltk_data/corpora/stopwords-flat.zip *
在我的代碼我使用:
importer = zipimport.zipimporter('nltkandyaml.mod')
yaml = importer.load_module('PyYAML-3.09')
nltk = importer.load_module('nltk-2.1.0.1rc1')
from nltk.corpus.reader import stopwords
from nltk.corpus.reader import StopWordsCorpusReader
nltk.data.path+=["."]
stopwords = StopWordsCorpusReader(nltk.data.find('lib/stopwords-flat.zip'))
最後我用命令:
bin/hadoop jar /home/../streaming/hadoop-0.21.0-streaming.jar -input
/user/root/input/voa.txt -output /user/root/output -mapper /home/../mapper.py -reducer
/home/../reducer.py -file /home/../nltkandyaml.mod -file /home/../stopwords-flat.zip
請告訴我在哪裏,我錯誤
謝謝你ü所有
您可以粘貼運行hadoop streaming作業時得到的錯誤消息。 – viper