2013-10-30 51 views
0

其實我是hadoop的新手,也是python ....所以我的疑問是如何在hadoop中運行python腳本.....而且我還在寫使用python..So wordcount的程序,才能不使用地圖減少執行該腳本.... 其實我寫的代碼我可以看到輸出如下 黑暗1 天堂2 它3 燈4 年齡5 6歲以下 全部7 全部8 權威9 之前10 之前11 之前12 相信13 最好的14 比較15 度16 絕望17 直接18 直接19如何在不使用map的情況下使用Python編寫wordcount程序reduce

It is counting number of words in a list..but whati have to achieve is grouping and deleting the duplicates and also count number of times of its occurrences ..... 

Below is my code . can somebody please tell me where i have done the mistake 

******************************************************** 
    Wordcount.py 
******************************************************** 

import urllib2 
import random 
from operator import itemgetter 

current_word = {} 
current_count = 0 
story = 'http://sixty-north.com/c/t.txt' 
request = urllib2.Request(story) 
response = urllib2.urlopen(request) 
each_word = [] 
words = None 
count = 1 
same_words ={} 
word = [] 
""" looping the entire file """ 
for line in response: 
    line_words = line.split() 
    for word in line_words: # looping each line and extracting words 
     each_word.append(word) 
     random.shuffle(each_word) 
     Sort_word = sorted(each_word) 
for words in Sort_word: 
    same_words = words.lower(),int(count) 
    #print same_words 
    #print words 
    if not words in current_word : 
     current_count = current_count +1 
     print '%s\t%s' % (words, current_count) 
    else: 
     current_count = 1 
     #if Sort_word == words.lower(): 
      #current_count += count 
current_count = count 
current_word = word 
     #print '2. %s\t%s' % (words, current_count) 

回答

0

對於運行基於Python的MR任務,看看:

http://hadoop.apache.org/docs/r1.1.2/streaming.html http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

您需要根據Mapper - Reducer設計代碼,以使Hadoop能夠執行您的Python腳本。在開始編寫代碼之前,先閱讀Map-Reduce Programming Paradigm。瞭解MR編程範例和{Key,value}對在解決問題中的作用非常重要。

#Modified your above code to generate the required output 
import urllib2 
import random 
from operator import itemgetter 

current_word = {} 
current_count = 0 
story = 'http://sixty-north.com/c/t.txt' 
request = urllib2.Request(story) 
response = urllib2.urlopen(request) 
each_word = [] 
words = None 
count = 1 
same_words ={} 
word = [] 
""" looping the entire file """ 
#Collect All the words into a list 
for line in response: 
    #print "Line = " , line 
    line_words = line.split() 
    for word in line_words: # looping each line and extracting words 
     each_word.append(word) 

#for every word collected, in dict same_words 
#if a key exists, such that key == word then increment Mapping Value by 1 
# Else add word as new key with mapped value as 1 
for words in each_word: 
    if words.lower() not in same_words.keys() : 
     same_words[words.lower()]=1 
    else: 
     same_words[words.lower()]=same_words[words.lower()]+1 

for each in same_words.keys(): 
    print "word = ",each, ", count = ",same_words[each] 
+0

謝謝你的回覆....我已經看到上面的鏈接你給我我只是在MRtask中運行mapper.py和reducer.py ...但我實際上想編寫一個特別的代碼到python – user2732609

+0

你是不是指獨立的python腳本? – Thejas

+0

是的,我做!!!!!!! – user2732609

相關問題