0
我有以下的數據幀CountVectorizer提取特徵
+------------------------------------------------+
|filtered |
+------------------------------------------------+
|[human, interface, computer] |
|[survey, user, computer, system, response, time]|
|[eps, user, interface, system] |
|[system, human, system, eps] |
|[user, response, time] |
|[trees] |
|[graph, trees] |
|[graph, minors, trees] |
|[graph, minors, survey] |
+------------------------------------------------+
以上專欄中,我得到下面的輸出運行CountVectorizer
後
+------------------------------------------------+-------------------
--------------------------+
|filtered |features |
+------------------------------------------------+---------------------------------------------+
|[human, interface, computer] |(12,[4,7,9],[1.0,1.0,1.0]) |
|[survey, user, computer, system, response, time]|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0])|
|[eps, user, interface, system] |(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |
|[system, human, system, eps] |(12,[0,9,10],[2.0,1.0,1.0]) |
|[user, response, time] |(12,[2,8,11],[1.0,1.0,1.0]) |
|[trees] |(12,[1],[1.0]) |
|[graph, trees] |(12,[1,3],[1.0,1.0]) |
|[graph, minors, trees] |(12,[1,3,5],[1.0,1.0,1.0]) |
|[graph, minors, survey] |(12,[3,5,6],[1.0,1.0,1.0]) |
+------------------------------------------------+---------------------------------------------+
現在我想運行的功能列的地圖功能和轉換它變成這樣的東西
+------------------------------------------------+--------------------------------------------------------+
|features |transformed |
+------------------------------------------------+--------------------------------------------------------+
|(12,[4,7,9],[1.0,1.0,1.0]) |["1 4 1", "1 7 1", "1 9 1"] |
|(12,[0,2,6,7,8,11],[1.0,1.0,1.0,1.0,1.0,1.0]) |["2 0 1", "2 2 1", "2 6 1", "2 7 1", "2 8 1", "2 11 1"] |
|(12,[0,2,4,10],[1.0,1.0,1.0,1.0]) |["3 0 1", "3 2 1", "3 4 1", "3 10 1"] |
[TRUNCATED]
方式特點tran通過從特徵中提取中間數組,然後從中創建子數組。例如,在第1行和col 1列的features
我們
(12,[4,7,9],[1.0,1.0,1.0])
現在把它的中間陣列是[4,7,9]
與第三列是[1.0,1.0,1.0]
前面加上「1」,因爲它是第1行,以獲得比較其頻率以下的輸出:
["1 4 1", "1 7 1", "1 9 1"]
這在一般看起來像這樣:
["RowNumber MiddleFeatEl CorrespondingFreq", ....]
我不能夠提取中東和最後頻率清單通過應用映射函數由CountVectorizer
生成功能列分別:
所以下面是地圖代碼:
def corpus_create(feats):
return feats[1] # Here i want to get [4,7,9] instead of 1 single feat score.
corpus_udf = udf(lambda feats: corpus_create(feats), StringType())
df3 = df.withColumn("corpus", corpus_udf("features"))