2016-11-09 17 views
0

在試圖在我的spark.sql查詢期間應用我的UDF,而不是以數組形式返回清理過的單詞時,查詢只返回一個看起來像我的數組的長字符串。這在嘗試應用CountVectorizer時給我一個錯誤。它引發的錯誤是'requirement failed: Column cleanedWords must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.'Spark sql查詢返回StringType而不是ArrayType?

這是我的代碼:

from string import punctuation 
from hebrew import stop_words 
hebrew_stopwords = stop_words() 

def removepuncandstopwords(listofwords): 
    newlistofwords = [] 
    for word in listofwords: 
     if word not in hebrew_stopwords: 
      for punc in punctuation: 
       word = word.strip(punc) 
      newlistofwords.append(word) 
    return newlistofwords 

from pyspark.ml.feature import CountVectorizer, IDF, Tokenizer, Normalizer 
from pyspark.sql.types import ArrayType, StringType 

sqlctx.udf.register("removepuncandstopwords", removepuncandstopwords, ArrayType(StringType())) 

sentenceData = spark.createDataFrame([ 
    (0, "Hello my friend; i am sam"), 
    (1, "Hello, my name is sam") 
], ["label", "sentence"]) 

tokenizer = Tokenizer(inputCol="sentence", outputCol="words") 
wordsData = tokenizer.transform(sentenceData) 
wordsData.registerTempTable("wordsData") 
wordsDataCleaned = spark.sql("select label, sentence, words, removepuncandstopwords(words) as cleanedWords from wordsData") 



wordsDataCleaned[['cleanedWords']].rdd.take(2)[0] 
Out[163]: 
Row(cleanedWords='[hello, my, friend, i, am, sam]') 

我怎樣才能解決這個問題?

回答

1

所以我剛剛遇到了這個錯誤。所以文件想要的數據結構的方式是

cleanedWords=['hello', 'my', 'friend', 'is', 'sam'] 

但是,你的似乎是不同的。因此,而不是這個

sentenceData = spark.createDataFrame([ 
(0, "Hello my friend; i am sam"), 
(1, "Hello, my name is sam")], 
["label", "sentence"]) 

我相信它應該是這個

documentDF = spark.createDataFrame([ 
(0, "Hello my friend; i am sam".split(" "),), 
(1, "Hello, my name is sam".split(" "),], 
["label", "sentence"]) 

來源:我只想過的文檔,他們結構類似

documentDF = spark.createDataFrame([ 
("Hi I heard about Spark".split(" "),), 
("I wish Java could use case classes".split(" "),), 
("Logistic regression models are neat".split(" "),) 
], ["text"]) 

鏈接他們的代碼 - https://spark.apache.org/docs/2.1.0/ml-features.html#word2vec

相關問題