0
在試圖在我的spark.sql查詢期間應用我的UDF,而不是以數組形式返回清理過的單詞時,查詢只返回一個看起來像我的數組的長字符串。這在嘗試應用CountVectorizer時給我一個錯誤。它引發的錯誤是'requirement failed: Column cleanedWords must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.'
Spark sql查詢返回StringType而不是ArrayType?
這是我的代碼:
from string import punctuation
from hebrew import stop_words
hebrew_stopwords = stop_words()
def removepuncandstopwords(listofwords):
newlistofwords = []
for word in listofwords:
if word not in hebrew_stopwords:
for punc in punctuation:
word = word.strip(punc)
newlistofwords.append(word)
return newlistofwords
from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, Normalizer
from pyspark.sql.types import ArrayType, StringType
sqlctx.udf.register("removepuncandstopwords", removepuncandstopwords, ArrayType(StringType()))
sentenceData = spark.createDataFrame([
(0, "Hello my friend; i am sam"),
(1, "Hello, my name is sam")
], ["label", "sentence"])
tokenizer = Tokenizer(inputCol="sentence", outputCol="words")
wordsData = tokenizer.transform(sentenceData)
wordsData.registerTempTable("wordsData")
wordsDataCleaned = spark.sql("select label, sentence, words, removepuncandstopwords(words) as cleanedWords from wordsData")
wordsDataCleaned[['cleanedWords']].rdd.take(2)[0]
Out[163]:
Row(cleanedWords='[hello, my, friend, i, am, sam]')
我怎樣才能解決這個問題?