我有(PY)星火兩個快速新秀的問題。我有一個數據框下面,我想用計算「閱讀」列的可能性SciPy的的multivariate_normal.pdf()
(PY)星火併行化最大似然法計算
rdd_dat = spark.sparkContext.parallelize([(0, .12, "a"),(1, .45, "b"),(2, 1.01, "c"),(3, 1.2, "a"),
(4, .76, "a"),(5, .81, "c"),(6, 1.5, "b")])
df = rdd_dat.toDF(["id", "reading", "category"])
df.show()
+---+-------+--------+
| id|reading|category|
+---+-------+--------+
| 0| 0.12| a|
| 1| 0.45| b|
| 2| 1.01| c|
| 3| 1.2| a|
| 4| 0.76| a|
| 5| 0.81| c|
| 6| 1.5| b|
+---+-------+--------+
這是使用UserDefinedFunction
我嘗試:
from scipy.stats import multivariate_normal
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
mle = UserDefinedFunction(multivariate_normal.pdf, DoubleType())
mean =1
cov=1
df_with_mle = df.withColumn("MLE", mle(df['reading']))
這將運行而不發出一個錯誤,但是當我想看看所產生的df_with_mle
,我得到的錯誤如下:
df_with_mle.show()
An error occurred while calling o149.showString.
1)任何想法,爲什麼我收到此錯誤?
2)如果我想指定mean
和cov
,如:df.withColumn("MLE", mle(df['reading'], 1, 1))
,我如何我做到這一點?