pyspark地圖函數返回不同

-1

>>> rdd.collect() 
[1, 2, 3, 4]

定義兩個地圖的功能，我相信應該是相同的pyspark地圖函數返回不同

def map2(l): 
    result=[] 
    for i in l: 
     result.append(((i,i),1)) 
     result.append((i,1)) 
    for j in result: 
     yield j

輸出

>>>rdd.mapPartitions(map2).collect() 
[((1, 1), 1), (1, 1), ((2, 2), 1), (2, 1), ((3, 3), 1), (3, 1), ((4, 4), 1), (4, 1)]

另一個功能

def map2(l): 
    result=[] 
    for i in l: 
      result.append(((i,i),1)) 
    for i in l: 
      result.append((i,1)) 
    for j in result: 
      yield j

輸出

>>> rdd.mapPartitions(map2).collect() 
[((1, 1), 1), ((2, 2), 1), ((3, 3), 1), ((4, 4), 1)]

來源

2017-02-10 llllo

爲什麼他們應該是相同的？ – khelwood

迭代器是有狀態的，不能遍歷多次。在第二個示例中執行的第一個for循環使用了所有可用的項目，並留下了空的迭代器，因此第二個循環中沒有任何內容可以添加。

如果你想穿越多次轉換迭代器列表：

def map2(l): 
    l = list(l) 
    result=[] 
    ...

來源

2017-03-24 01:29:40 user7759702

pyspark地圖函數返回不同

回答

相關問題