我正在使用名爲mincemeat.py的映射簡化實現。它包含一個map函數和reduce函數。首先,我會告訴我想要完成什麼。我正在做一個關於bigdata的課程,在那裏有一個編程任務。問題是,有數百個包含表單數據的文件paperid ::: author1 :: author2 :: author3 ::: papertitle返回字典的Mincemeat映射函數
我們必須通過所有的文件並給出一個特定的作者,這個詞他已經習慣了最大限度的。所以我爲它寫了下面的代碼。現在
import re
import glob
import mincemeat
from collections import Counter
text_files = glob.glob('test/*')
def file_contents(file_name):
f = open(file_name)
try:
return f.read()
finally:
f.close()
datasource = dict((file_name, file_contents(file_name)) for file_name in text_files)
def mapfn(key, value):
for line in value.splitlines():
wordsinsentence = line.split(":::")
authors = wordsinsentence[1].split("::")
# print authors
words = str(wordsinsentence[2])
words = re.sub(r'([^\s\w-])+', '', words)
# re.sub(r'[^a-zA-Z0-9: ]', '', words)
words = words.split(" ")
for author in authors:
for word in words:
word = word.replace("-"," ")
word = word.lower()
yield author, word
def reducefn(key, value):
return Counter(value)
s = mincemeat.Server()
s.datasource = datasource
s.mapfn = mapfn
s.reducefn = reducefn
results = s.run_server(password="changeme")
# print results
i = open('outfile','w')
i.write(str(results))
i.close()
我的問題是,降低函數接收AUTHORNAME和他在他的頭銜已經使用,所有作者的所有單詞。所以,我希望像
{authorname: Counter({'word1':countofword1,'word2':countofword2,'word3':countofword3,..}).
但輸出什麼,我得到的是
authorname: (authorname, Counter({'word1': countofword1,'word2':countofword2}))
誰能告訴它爲什麼發生這樣呢?我不需要幫助來解決問題,我需要幫助才能知道爲什麼會這樣發生!
請刪除代碼,這違反了coursera代碼。 – vamosrafa