Hadoop和NLTK：停用詞彙失敗

我試圖在Hadoop上運行Python程序。該計劃涉及NLTK圖書館。該程序還使用Hadoop Streaming API，如here所述。Hadoop和NLTK：停用詞彙失敗

mapper.py：

#!/usr/bin/env python 
import sys 
import nltk 
from nltk.corpus import stopwords 

#print stopwords.words('english') 

for line in sys.stdin: 
     print line,

reducer.py：

#!/usr/bin/env python 

import sys 
for line in sys.stdin: 
    print line,

控制檯命令：

bin/hadoop jar contrib/streaming/hadoop-streaming.jar \ -file /hadoop/mapper.py -mapper /hadoop/mapper.py -file /hadoop/reducer.py -reducer /hadoop/reducer.py -input /hadoop/input.txt -output /hadoop/output

這將運行perfe ctly，輸出只包含輸入文件的行。

然而，當該線路（從mapper.py）：

#PRINT stopwords.words（ '英語'）

是未註釋，則程序失敗，並且說

Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1.

我已經檢查並在獨立的python程序中，

print stopwords.words('english')

完美地工作，所以我絕對難以理解爲什麼它導致我的Hadoop程序失敗。

我將不勝感激任何幫助！謝謝

來源

2013-09-27 Objc55

您的hadoop目錄中沒有ntlk語料庫。試試這個 http://stackoverflow.com/questions/10716302/how-to-import-nltk-corpus-in-hdfs-when-i-use-hadoop-streaming – user1525721

試試這個--- http： //stackoverflow.com/questions/6811549/how-can-i-include-a-python-package-with-hadoop-streaming-job – user1525721

@ user1525721感謝您的答覆。將嘗試並回發。如果我在所有節點上都有NLTK，這是否仍然有必要？ – Objc55

是'英文'print stopwords.words('english')中的文件？如果是的話，你也需要使用-file來將它發送到節點。

來源

2013-09-30 22:07:21

使用這些命令解壓：

importer = zipimport.zipimporter('nltk.zip') 
    importer2=zipimport.zipimporter('yaml.zip') 
    yaml = importer2.load_module('yaml') 
    nltk = importer.load_module('nltk')

檢查我在上面粘貼的鏈接。他們提到了所有的步驟。

來源

2013-09-27 23:56:28 user1525721

我是否需要通過控制檯命令發送這些文件，還是將它們存儲在每臺計算機上的本地？另外，我需要nltk.zip還是nltk_data.zip？我怎樣才能找到前者？ yaml在這方面扮演什麼角色？謝謝！ – Objc55

我嘗試了你的建議，並導入nltk和yaml沒有任何問題。但是，我仍然無法使用停用詞。 '從nltk.corpus導入stopwords'不會導致程序失敗，但只要輸入'print stopwords.words（'english'）'，它就會失敗。任何想法如何解決？我已經在控制檯命令中加入了這個：'-archives。/ stopwords.zip'謝謝！ – Objc55

Hadoop和NLTK：停用詞彙失敗

回答

相關問題