mapreduce

一個玩具的例子我是hadoop和python的新手。我想知道如何改進算法。mapreduce

這就是問題：（使用映射精簡結構解決它）

我們將提供數據集具有不同尺寸，從新浪微博用戶的關係生成。較小的數據集包含1000個用戶，中等數據集包含約250萬用戶，而大數據集包含480萬用戶。每個用戶都由其唯一的ID號代表。

數據文件的格式是如下（由空格隔開不同的追隨者）：

followee_1_id:follower_1_id follower_2_id follower_3_id .... 
followee_2_id:follower_1_id follower_6_id follower_7_id .... ...

例如。

A:B D 
B:A 
C:A B E 
E:A B C

社區檢測的輸出是用於EVERY用戶，我們想知道前k最相似的人。輸出格式應該（由空格分隔不同類似的人）：

User_1:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K 
User_2:Similiar_Person_1 Similiar_Person_2 ... Similiar_Person_K

（其中K是指10,000）

我的解決辦法：
我的算法是保持在一個列表最多10,000個類似的人，並且每當類似人數達到10,001時對該清單進行分類。然後彈出最後一個。之後，我發現當數據集很大時，大概需要(n-10000).n.log(n)時間來執行，關於如何提高呢？

我的觀察：
一些粗略計算後，我發現，如果類似的人少，我們應該保持緩衝區大。例如，如果一個人擁有5000個相似的人，那麼我們可以將該列表的上限限制爲10萬。然後，我們只需要對列表進行一次排序，即在打印結果之前。

#!/usr/bin/env python 

from operator import itemgetter 
import sys 

def print_list_of_dict(list_of_dic): 
    for v in list_of_dic: 
     print v['name'], 
    print 
return 

current_person1 = None 
current_person2 = None 
current_S = 0 
#declare a list of dictionary 
ranking = [] 
d = {} 
flag = 0 

# input comes from STDIN 
for line in sys.stdin: 
    # remove leading and trailing whitespace 
    line = line.strip() 

    # parse the input we got from mapper.py 
    person1, person2 = line.split() 

    # first person first relation 
    if not current_person1: 
     current_person1 = person1 
     current_person2 = person2 
     current_S += 1 
    else: 
     # same person , same relation 
     if current_person1 == person1 and current_person2 == person2: 
      current_S += 1 
      flag = 0 
     # same person , different relation 
     elif current_person1 == person1 and current_person2 != person2: 
      d['name'] = current_person2 
      d['similarity'] = current_S 
      ranking.append(d.copy()) 
      if len(ranking) == 10001: 
       ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True) 
       ranking.pop() 
      current_person2 = person2 
      current_S = 1 
      flag = 1 
     # different person 
     else: 
      d['name'] = current_person2 
      d['similarity'] = current_S 
      ranking.append(d.copy()) 
      if len(ranking) == 10001: 
       ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True) 
       ranking.pop() 
      ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True) 
      print current_person1,':', 
      print_list_of_dict(ranking) 
      # a new dictionary 
      ranking = [] 
      current_person1 = person1 
      current_person2 = person2 
      current_S = 1 
      flag = 2 
# add and print the last relation to dictionary 
d['name'] = current_person2 
d['similarity'] = current_S 
ranking.append(d.copy()) 
if len(ranking) == 10001: 
    ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True) 
    ranking.pop() 
ranking = sorted(ranking,key=itemgetter('similarity'),reverse = True) 
print current_person1,':', 
print_list_of_dict(ranking)

來源

2016-02-14 good BOB

問題是，如果列表滿了（達到其限制 - 10,000），我需要每次對列表進行排序。 –

經過一些粗略的計算，我發現如果相似的人很小，我們應該保持較大的緩衝區。例如，如果一個人擁有5000個相似的人，那麼我們可以將列表的上限設爲100,000。然後，我們只需要對列表進行一次排序，即在打印結果之前。 –

已解決，將所有內容全部存儲在內存中，並且僅在分類一次後纔打印前10000個。

來源

2016-02-24 13:50:06

回答

相關問題