我試圖在Hadoop上運行Python程序。該計劃涉及NLTK圖書館。該程序還使用Hadoop Streaming API,如here所述。Hadoop和NLTK:停用詞彙失敗
mapper.py:
#!/usr/bin/env python
import sys
import nltk
from nltk.corpus import stopwords
#print stopwords.words('english')
for line in sys.stdin:
print line,
reducer.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line,
控制檯命令:
bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output
這將運行perfe ctly,輸出只包含輸入文件的行。
然而,當該線路(從mapper.py):
#PRINT stopwords.words( '英語')
是未註釋,則程序失敗,並且說
Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.
我已經檢查並在獨立的python程序中,
print stopwords.words('english')
完美地工作,所以我絕對難以理解爲什麼它導致我的Hadoop程序失敗。
我將不勝感激任何幫助!謝謝
您的hadoop目錄中沒有ntlk語料庫。 試試這個 http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming – user1525721
試試這個--- http: //stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job – user1525721
@ user1525721感謝您的答覆。將嘗試並回發。如果我在所有節點上都有NLTK,這是否仍然有必要? – Objc55