2017-06-11 51 views
0

我想在Iterable對象中找到所有可能的組合。按組查找所有組合PySpark

我輸入

Object1|DrDre|1.0 
Object1|Plane and a Disaster|2.0 
Object1|Tikk Takk Tikk|3.5 
Object1|Tennis Dope|5.0 
Object2|DrDre|11.0 
Object2|Plane and a Disaster|14.0 
Object2|Just My Luck|2.0 
Object2|Tennis Dope|45.0 

的預期結果會是這樣的:

[(('DrDre', 'Plane and a Disaster'), (11.0, 14.0, 1.0, 2.0)), 
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)), 
(('DrDre', 'Tennis Dope'), (11.0, 45.0, 1.0, 5.0)), 
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)), 
(('Plane and a Disaster', 'Tennis Dope'), (14.0, 45.0, 2.0, 5.0)), 
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 45.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))] 

這是我當前的代碼,它不給我正確的組合到底。

def iterate(iterable): 
    r = [] 
    for v1_iterable in iterable: 
     for v2 in v1_iterable: 
      r.append(v2) 

    return tuple(r) 

def parseVector(line): 
    ''' 
    Parse each line of the specified data file, assuming a "|" delimiter. 
    Converts each rating to a float 
    ''' 
    line = line.split("|") 
    return line[0],(line[1],float(line[2])) 

def FindPairs(object_id,items_with_usage): 
    ''' 
    For each objects, find all item-item pairs combos. (i.e. items with the same user) 
    ''' 
    for item1,item2 in combinations(items_with_usage,2): 
     return (item1[0],item2[0]),(item1[1],item2[1]) 


''' 
Obtain the sparse object-item matrix: 
    user_id -> [(object_id_1, rating_1), 
       [(object_id_2, rating_2), 
       ...] 
''' 
object_item_pairs = lines.map(parseVector).groupByKey().map(
    lambda p: sampleInteractions(p[0],p[1],500)).cache() 


''' 
Get all item-item pair combos: 
    (item1,item2) -> [(item1_rating,item2_rating), 
         (item1_rating,item2_rating), 
         ...] 
''' 

pairwise_objects = object_item_pairs.filter(
    lambda p: len(p[1]) > 1).map(
    lambda p: findItemPairs(p[0],p[1])).groupByKey() 



x = pairwise_objects.mapValues(iterate) 
x.collect() 

這隻給我回第一對,沒有別的。

[(( 'DrDre', '平面和災難'),(11.0,14.0,1.0,2.0))]

我誤解的組合的功能性()函數?

感謝您的輸入

+0

你把'return'命令放在_for_循環中,這意味着循環將在第一個循環結束。這就是爲什麼你只有第一對,因爲你沒有存儲'組合(items_with_usage,2)'的所有元素,你只返回第一對物品 – titiro89

+0

啊,非常感謝你titiro89! :) – ponthu

回答

1

我想你可以用這種方式

def FindPairs(object_id,items_with_usage): 
''' 
For each objects, find all item-item pairs combos. (i.e. items with the same user) 
''' 
t = [] 
for item1,item2 in combinations(items_with_usage,2): 
    t.append(((item1[0],item2[0]),(item1[1],item2[1]))) 
return t 

改變你FindPairs現在,你的函數會返回一個列表的所有對的組合。

然後

pairwise_objects= pairwise_objects.filter(lambda p: len(p[1]) > 1) 
pairwise_objects= pairwise_objects.map(lambda p: FindPairs(p[0],p[1])) 

[[(('DrDre', 'Plane and a Disaster'), (1.0, 2.0)), 
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)), 
(('DrDre', 'Tennis Dope'), (1.0, 5.0)), 
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)), 
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0)), 
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0))], # end of the first line of the RDD 
[(('DrDre', 'Plane and a Disaster'),(11.0, 14.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('DrDre', 'Tennis Dope'), (11.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)), 
(('Plane and a Disaster', 'Tennis Dope'), (14.0, 45.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]] 

使用flatMap(所以你必須用你所有的對單行)分組您的RDD和應用的功能

pairwise_objects=pairwise_objects.flatMap(lambda p: p).groupByKey().mapValues(iterate) 

最終輸出前:

[(('DrDre', 'Tennis Dope'), (1.0, 5.0, 11.0, 45.0)), 
(('DrDre', 'Plane and a Disaster'), (1.0, 2.0, 11.0, 14.0)), 
(('Plane and a Disaster', 'Tennis Dope'), (2.0, 5.0, 14.0, 45.0)), 
(('Plane and a Disaster', 'Just My Luck'), (14.0, 2.0)), 
(('Plane and a Disaster', 'Tikk Takk Tikk'), (2.0, 3.5)), 
(('DrDre', 'Tikk Takk Tikk'), (1.0, 3.5)), 
(('Tikk Takk Tikk', 'Tennis Dope'), (3.5, 5.0)), 
(('DrDre', 'Just My Luck'), (11.0, 2.0)), 
(('Just My Luck', 'Tennis Dope'), (2.0, 45.0))]