如何將列表轉換爲str並將每個單詞擴展爲SparkSQL中的某一行？

我需要使用星火MLLib的StringIndexer排名單詞頻，但它需要一個像如何將列表轉換爲str並將每個單詞擴展爲SparkSQL中的某一行？

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")], 
["id", "category"])

格式，但我現在就像是

df = spark.createDataFrame(
[(0, ['a', 'b']), (1, ['b', 'c']), (2, ['c','g']), (3, ['a','b']), (4, ['a','b']), (5, ['c','a'])], 
["id", "category"])

所以，我需要轉移將每行的列表轉換爲單詞，然後將一行擴展爲兩行，以便每行包含一個單詞。然後，我需要將我們從StringIndexer得到的排名返回到原始行，例如，如果'a'排名1且'b'排名3，那麼爲第一行添加一個新列1,3。我該怎麼做？

來源

2017-04-05 Liu Chong

不知道這是你正在尋找確切的輸出，但這裏的使用explode()和collect_list()的方法：

from pyspark.sql.functions import explode, collect_list 
from pyspark.ml.feature import StringIndexer 

df_exploded = df.select("id", explode("category").alias("category")) 
indexer = StringIndexer(inputCol="category", outputCol="categoryIndex") 
indexed = indexer.fit(df_exploded).transform(df_exploded) 

indexed.groupBy("id")\ 
    .agg(collect_list("category").alias("category"), 
     collect_list("categoryIndex").alias("categoryIndex")) \ 
    .sort(asc("id")) \ 
    .show() 
+---+--------+-------------+ 
| id|category|categoryIndex| 
+---+--------+-------------+ 
| 0| [a, b]| [0.0, 1.0]| 
| 1| [b, c]| [1.0, 2.0]| 
| 2| [c, g]| [2.0, 3.0]| 
| 3| [a, b]| [0.0, 1.0]| 
| 4| [a, b]| [0.0, 1.0]| 
| 5| [c, a]| [2.0, 0.0]| 
+---+--------+-------------+

來源

2017-04-05 09:37:49 mtoto

非常感謝您！ –

如何將列表轉換爲str並將每個單詞擴展爲SparkSQL中的某一行？

回答

相關問題