星火 - 單柱到X類

目前的專欄中，我有一列看起來像這樣星火 - 單柱到X類

color 
----- 
green 
blue 
green 
red 
yellow 
red 
orange

等等的數據幀（30種不同的顏色）。

從這個專欄中，我想將其轉換爲類似這樣的

green blue red yellow orange purple ... more colors 
    1  0 0  0  0  0 
    0  1 0  0  0  0 
    1  0 0  0  0  0 
    0  0 1  0  0  0 
    0  0 0  1  0  0 
    0  0 1  0  0  0 
    0  0 0  0  1  0

有每個變量設置爲0，除了因爲這是原始的同一個索引顏色A的數據幀一個數據幀柱。

到目前爲止，我已經嘗試了不同的功能和解決方案，其中沒有工作（和代碼看起來非常的混亂）。我在想，如果有一個「簡單」或簡單的方法來做到這一點，或者我應該使用像熊貓另一個庫（我使用Python）。如果你知道R，那麼我想要的是table函數。

感謝

來源

2015-09-01 user3276768

像這樣的東西應該做的伎倆：

from pyspark.sql.functions import when, lit, col 

colors = df.select("color").distinct().map(lambda x: x[0]).collect() 
cols = (
    when(col("color") == lit(color), 1).otherwise(0).alias(color) 
    for color in colors 
) 

df.select(*cols)

如果您正在尋找類似的於R table另一種解決方案，你可能想看看在crosstab和cube。

注

當水平的數目大創建密數據幀變得相當低效的。在這種情況下，你應該考慮使用稀疏矢量：

from pyspark.sql import Row 
from pyspark.mllib.linalg import Vectors 
from pyspark.ml.feature import StringIndexer 

def toVector(n): 
    def _toVector(i): 
     return Row("vec")(Vectors.sparse(n, {i: 1.0})) 
    return _toVector 

indexer = StringIndexer(inputCol="color", outputCol="colorIdx") 
indexed = indexer.fit(df).transform(df) 
n = indexed.select("colorIdx").distinct().count() 

vectorized = indexed.select("colorIdx").map(toVector(n)).toDF()

來源

2015-09-01 02:36:25 zero323

嗨，謝謝你的答案。我嘗試了你的解決方案，但不幸的是工作不正常。輸出是一個數據幀，其中每列具有相同的名稱（「顏色」），而不是實際的顏色名稱和數據框的值始終爲0。謝謝您的回答。 – user3276768

對不起，'別名'錯字。現在應該可以。 – zero323

它的工作。非常感謝！ – user3276768

星火 - 單柱到X類

回答

相關問題