2017-07-25 50 views
12

我想,因爲有獨特的元素來分解選自元素列表的成許多列的熊貓柱即one-hot-encode他們(具有值1表示不存在的情況下存在於行和0一個給定的元素)。如何從包含列表的熊貓列中進行一次熱編碼?

例如,以數據幀DF

Col1 Col2   Col3 
C  33  [Apple, Orange, Banana] 
A  2.5 [Apple, Grape] 
B  42  [Banana] 

我想將其轉換爲:

DF

Col1 Col2 Apple Orange Banana Grape 
C  33  1  1  1  0 
A  2.5 1  0  0  1 
B  42  0  0  1  0 

如何使用熊貓/ sklearn實現這個?

回答

15

我們也可以使用sklearn.preprocessing.MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer 

mlb = MultiLabelBinarizer() 
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Col3')), 
          columns=mlb.classes_, 
          index=df.index)) 

結果:

In [77]: df 
Out[77]: 
    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 
+1

你可能會發現有趣的時間。 – piRSquared

6

使用get_dummies

df_out = df.assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 

輸出:

Col1 Col2      Col3 Apple Banana Grape Orange 
0 C 33.0 [Apple, Orange, Banana]  1  1  0  1 
1 A 2.5   [Apple, Grape]  1  0  1  0 
2 B 42.0     [Banana]  0  1  0  0 

清理柱:

df_out.drop('Col3',axis=1) 

輸出:

Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 
+1

+1與'get_dummies'使用''**,但是這可能是因爲'.STACK()'和方法鏈的大dataframes緩慢。 –

+0

@BradSolomon謝謝。 –

+0

我不確定這是否正常工作...嘗試後:'df = pd.concat([df,df])' – Alexander

5

你可以通過Col3環路與apply,每個元素轉換成一系列的列表作爲成爲結果數據幀的報頭中的指標:

pd.concat([ 
     df.drop("Col3", 1), 
     df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 

#Col1 Col2 Apple Banana Grape Orange 
#0 C 33.0  1.0  1.0 0.0  1.0 
#1 A 2.5  1.0  0.0 1.0  0.0 
#2 B 42.0  0.0  1.0 0.0  0.0 
5

你可以在Col3所有獨特的水果使用設定的理解如下:

set(fruit for fruits in df.Col3 for fruit in fruits) 

使用字典理解,然後你可以去通過每一個獨特的水果,看看它是否在列。

>>> df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3] 
            for fruit in set(fruit for fruits in df.Col3 
                for fruit in fruits)}) 
    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 

時序

dfs = pd.concat([df] * 1000) # Use 3,000 rows in the dataframe. 

# Solution 1 by @Alexander (me) 
%%timeit -n 1000 
dfs[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in dfs.Col3] 
           for fruit in set(fruit for fruits in dfs.Col3 for fruit in fruits)}) 
# 10 loops, best of 3: 4.57 ms per loop 

# Solution 2 by @Psidom 
%%timeit -n 1000 
pd.concat([ 
     dfs.drop("Col3", 1), 
     dfs.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 
# 10 loops, best of 3: 748 ms per loop 

# Solution 3 by @MaxU 
from sklearn.preprocessing import MultiLabelBinarizer 
mlb = MultiLabelBinarizer() 

%%timeit -n 10 
dfs.join(pd.DataFrame(mlb.fit_transform(dfs.Col3), 
          columns=mlb.classes_, 
          index=dfs.index)) 
# 10 loops, best of 3: 283 ms per loop 

# Solution 4 by @ScottBoston 
%%timeit -n 10 
df_out = dfs.assign(**pd.get_dummies(dfs.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 
# 10 loops, best of 3: 512 ms per loop 

But... 
>>> print(df_out.head()) 
    Col1 Col2      Col3 Apple Banana Grape Orange 
0 C 33.0 [Apple, Orange, Banana] 1000 1000  0 1000 
1 A 2.5   [Apple, Grape] 1000  0 1000  0 
2 B 42.0     [Banana]  0 1000  0  0 
0 C 33.0 [Apple, Orange, Banana] 1000 1000  0 1000 
1 A 2.5   [Apple, Grape] 1000  0 1000  0 
10

選項1
簡短回答
pir_slow

df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies()) 

    Col1 Col2 Apple Banana Grape Orange 
0 C 33.0  1  1  0  1 
1 A 2.5  1  0  1  0 
2 B 42.0  0  1  0  0 

選項2
快速回答
pir_fast

v = df.Col3.values 
l = [len(x) for x in v.tolist()] 
f, u = pd.factorize(np.concatenate(v)) 
n, m = len(v), u.size 
i = np.arange(n).repeat(l) 

dummies = pd.DataFrame(
    np.bincount(i * m + f, minlength=n * m).reshape(n, m), 
    df.index, u 
) 

df.drop('Col3', 1).join(dummies) 

    Col1 Col2 Apple Orange Banana Grape 
0 C 33.0  1  1  1  0 
1 A 2.5  1  0  0  1 
2 B 42.0  0  0  1  0 

選項3
pir_alt1

df.drop('Col3', 1).join(
    pd.get_dummies(
     pd.DataFrame(df.Col3.tolist()).stack() 
    ).astype(int).sum(level=0) 
) 

    Col1 Col2 Apple Orange Banana Grape 
0 C 33.0  1  1  1  0 
1 A 2.5  1  0  0  1 
2 B 42.0  0  0  1  0 

時序結果
代碼下面

enter image description here


def maxu(df): 
    mlb = MultiLabelBinarizer() 
    d = pd.DataFrame(
     mlb.fit_transform(df.Col3.values) 
     , df.index, mlb.classes_ 
    ) 
    return df.drop('Col3', 1).join(d) 


def bos(df): 
    return df.drop('Col3', 1).assign(**pd.get_dummies(df.Col3.apply(lambda x:pd.Series(x)).stack().reset_index(level=1,drop=True)).sum(level=0)) 

def psi(df): 
    return pd.concat([ 
     df.drop("Col3", 1), 
     df.Col3.apply(lambda x: pd.Series(1, x)).fillna(0) 
    ], axis=1) 

def alex(df): 
    return df[['Col1', 'Col2']].assign(**{fruit: [1 if fruit in cell else 0 for cell in df.Col3] 
             for fruit in set(fruit for fruits in df.Col3 
                 for fruit in fruits)}) 

def pir_slow(df): 
    return df.drop('Col3', 1).join(df.Col3.str.join('|').str.get_dummies()) 

def pir_alt1(df): 
    return df.drop('Col3', 1).join(pd.get_dummies(pd.DataFrame(df.Col3.tolist()).stack()).astype(int).sum(level=0)) 

def pir_fast(df): 
    v = df.Col3.values 
    l = [len(x) for x in v.tolist()] 
    f, u = pd.factorize(np.concatenate(v)) 
    n, m = len(v), u.size 
    i = np.arange(n).repeat(l) 

    dummies = pd.DataFrame(
     np.bincount(i * m + f, minlength=n * m).reshape(n, m), 
     df.index, u 
    ) 

    return df.drop('Col3', 1).join(dummies) 

results = pd.DataFrame(
    index=(1, 3, 10, 30, 100, 300, 1000, 3000), 
    columns='maxu bos psi alex pir_slow pir_fast pir_alt1'.split() 
) 

for i in results.index: 
    d = pd.concat([df] * i, ignore_index=True) 
    for j in results.columns: 
     stmt = '{}(d)'.format(j) 
     setp = 'from __main__ import d, {}'.format(j) 
     results.set_value(i, j, timeit(stmt, setp, number=10)) 
+1

真是太棒了! PS我剛剛使用了我今天的最後投票鏡頭;-) – MaxU

+0

@MaxU謝謝你( - : – piRSquared

+0

太快了!就像你的時序圖一樣,我假設* x軸*是數據框中的行數? – Alexander