2015-03-02 24 views
0

所以我試圖做一些統計分析,我一直在做總和有點不同於stdev。Spark .stdev()Python問題

總和正常工作是這樣的:

stats[0] = myData2.map(lambda (Column, values): (sum(values))).collect() 

髮網的格式不同而無法正常工作:

stats[4] = myData2.map(lambda (Column, values): (values)).stdev() 

我收到以下錯誤:

TypeError: unsupported operand type(s) for -: 'ResultIterable' and 'float' 

回答

1

首先解決使用NumPy

data=[(1,[1,2,3,4,5]),(2,[6,7,8,9]),(3,[1,3,5,7])] 
dataRdd = sc.parallelize(data) 
import numpy 
dataRdd.mapValues(lambda values: numpy.std(values)).collect() 
# Result 
# [(1, 1.4142135623730951), (2, 1.1180339887498949), (3, 2.2360679774997898)] 

二液DIY,做它更加分散

data = [(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (2, 6), (2, 7), (2, 8), (2, 9), (3, 1), (3, 3), (3, 5), (3, 7)] 
# Generate RDD of (Key, (Sum, Sum of squares, Count)) 
dataSumsRdd = dataRdd.aggregateByKey((0.0, 0.0, 0.0), 
         lambda (sum, sum2, count), value: (sum + float(value), sum2 + float(value**2), count+1.0), 
         lambda (suma, sum2a, counta), (sumb, sum2b, countb): (suma + sumb, sum2a + sum2b, counta + countb)) 
# Generate RDD of (Key, (Count, Average, Std Dev)) 
import math 
dataStatsRdd = dataSumsRdd.mapValues(lambda (sum, sum2, count) : (count, sum/count, math.sqrt(sum2/count - (sum/count)**2))) 
# Result 
# [(1, (5.0, 3.0, 1.4142135623730951)), (2, (4.0, 7.5, 1.118033988749895)), (3, (4.0, 4.0, 2.23606797749979))] 
+0

我得到這個當我嘗試稍後打印:標準開發:PythonRDD [166]在RDD在PythonRDD.scala:43個 – theMadKing 2015-03-02 17:39:12

+0

感謝。我如何從第二個解決方案中去掉標準偏差? – theMadKing 2015-03-02 17:44:06

+0

只需更改最終的mapValues調用,並使其僅返回std dev而不是元組 – 2015-03-02 17:58:57