0
我正在使用PySpark作爲工具PCA分析,但我有錯誤,由於從CSV文件中讀取數據的配伍。我該怎麼辦?你能幫我嗎?PCA分析與PySpark
from __future__ import print_function
from pyspark.ml.feature import PCA
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import SparkSession
from pyspark import SparkConf, SparkContext
from pyspark.sql.functions import udf
import pandas as pd
import numpy as np
from numpy import array
conf = SparkConf().setAppName("building a warehouse")
sc = SparkContext(conf=conf)
if __name__ == "__main__":
spark = SparkSession\
.builder\
.appName("PCAExample")\
.getOrCreate()
data = sc.textFile('dataset.csv') \
.map(lambda line: line.split(','))\
.collect()
#create a data frame from data read from csv file
df = spark.createDataFrame(data, ["features"])
#convert data to vector udt
df.show()
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(df)
result = model.transform(df).select("pcaFeatures")
result.show(truncate=False)
spark.stop()
這裏是我得到的錯誤:
File "C:/spark/spark-2.1.0-bin-hadoop2.7/bin/pca_bigdata.py", line 38, in <module>
model = pca.fit(df)
pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Column features must be of type [email protected] but was actually StringType.'
你能提供一個文件的例子嗎?謝謝。 – Keith
它包含了數據這樣的:15,447176933288574,58783,89453125,117,73371124267578,0,0,0,30145,232421875,127,86238861083984,30113,59375,126,52108001708984,512,08636474609375,514,4246826171875,571 ,90142822265625,573,742431640625,586,60888671875,571,6429443359375 ,, –
您的數字還在讀作字符串沒有花車,做圖是這樣的:'數據= sc.textFile(「dataset.csv」)地圖( lambda行:[float(k)for line in line.split(',')])' –