2
2個差異文件如何內部聯接我想找出基於用戶年齡組18和25之間 我有兩個文件頂部網站的頁面訪問,一個包含用戶名,年齡和其他文件中包含用戶名,網站名稱。例子:的hadoop streaming - 使用Python
users.txt
約翰,22
pages.txt
約翰,google.com
我已經寫在下面python,它的工作原理與我在hadoop之外的預期一樣。
import os
os.chdir("/home/pythonlab")
#Top sites visited by users aged 18 to 25
#read the users file
lines = open("users.txt")
users = [ line.split(",") for line in lines] #user name, age (eg - john, 22)
userlist = [ (u[0],int(u[1])) for u in users] #split the user name and age
#read the page visit file
pages = open("pages.txt")
page = [p.split(",") for p in pages] #user name, website visited (eg - john,google.com)
pagelist = [ (p[0],p[1]) for p in page]
#map user and page visits & filter age group between 18 and 25
usrpage = [[p[1],u[0]] for u in userlist for p in pagelist if (u[0] == p[0] and u[1]>=18 and u[1]<=25) ]
for z in usrpage:
print(z[0].strip('\r\n')+",1") #print website name, 1
輸出示例:
yahoo.com,1 google.com,1
現在我想解決這個使用Hadoop流。
我的問題是,我該如何處理我的映射這兩個名稱的文件(users.txt,pages.txt)?我們通常只將輸入目錄傳遞給hadoop流。
請小心格式化;您在小文本旁邊有大量文本,並且代碼未格式化。 – bcr